xyzpy.gen.farming¶

Objects for labelling and succesively running functions.

Classes¶

`Runner`	Container class with all the information needed to systematically
`Harvester`	Container class for collecting and aggregating data to disk.
`Sampler`	Like a Harvester, but randomly samples combos and writes the table of

Functions¶

`label`(var_names[, fn_args, var_dims, var_coords, ...])	Convenient decorator to automatically wrap a function as a
`cultivate`(fn, *[, var_names, data_name, runner_opts, ...])	Convenience function to run a full cycle of annotating a function,

Module Contents¶

class xyzpy.gen.farming.Runner(fn, var_names, fn_args=None, var_dims=None, var_coords=None, constants=None, resources=None, attrs=None, **default_runner_settings)[source]¶

Bases: object

Container class with all the information needed to systematically run a function over many parameters and capture the output in a dataset.

Parameters:

fn (callable) – Function that produces a single instance of a result.
var_names (str, sequence of str, or None) – The ordered name(s) of the ouput variable(s) of fn. Set this explicitly to None if fn outputs already labelled data as a dict, Dataset, or DataArray.
fn_args (str, or sequence of str, optional) – The ordered name(s) of the input arguments(s) of fn. This is only needed if the cases or combos supplied are not dict-like.
var_dims (dict-like, optional) – Mapping of output variables to their named internal dimensions, can be the names of constants.
var_coords (dict-like, optional) – Mapping of output variables named internal dimensions to the actual values they take.
constants (dict-like, optional) – Constants arguments to be supplied to fn. These can be used as ‘var_dims’, and will be saved as coords if so, otherwise as attributes.
resources (dict-like, optional) – Like constants but not saved to the the dataset, e.g. if very big.
attrs (dict-like, optional) – Any other miscelleous information to be saved with the dataset.
default_runner_settings – These keyword arguments will be supplied as defaults to any runner.

fn¶

_var_names = (None,)¶

_fn_args¶

_var_dims¶

_var_coords¶

_constants¶

_resources¶

_attrs¶

_last_ds = None¶

default_runner_settings¶

__call__(*args, **kwargs)[source]¶

_get_fn_args()[source]¶

_set_fn_args(fn_args)[source]¶

_del_fn_args()[source]¶

fn_args¶

_get_var_names()[source]¶

_set_var_names(var_names)[source]¶

_del_var_names()[source]¶

var_names¶

_get_var_dims()[source]¶

_set_var_dims(var_dims, var_names=None)[source]¶

_del_var_dims()[source]¶

var_dims¶

_get_var_coords()[source]¶

_set_var_coords(var_coords)[source]¶

_del_var_coords()[source]¶

var_coords¶

_get_constants()[source]¶

_set_constants(constants)[source]¶

_del_constants()[source]¶

constants¶

_get_resources()[source]¶

_set_resources(resources)[source]¶

_del_resources()[source]¶

resources¶

property last_ds¶

run_combos(combos, constants=(), **runner_settings)[source]¶

Run combos using the function map and save to dataset.

Parameters:

combos (dict_like[str, iterable]) – The values of each function argument with which to evaluate all combinations.
constants (dict, optional) – Extra constant arguments for this run, repeated arguments will take precedence over stored constants but for this run only.
runner_settings – Keyword arguments supplied to combo_runner().

run_cases(cases, constants=(), fn_args=None, **runner_settings)[source]¶

Run cases using the function and save to dataset.

Parameters:

cases (sequence of mappings or tuples) – A sequence of cases.
constants (dict (optional)) – Extra constant arguments for this run, repeated arguments will take precedence over stored constants but for this run only.
runner_settings – Supplied to case_runner().

Crop(name=None, parent_dir=None, save_fn=None, batchsize=None, num_batches=None)[source]¶

Return a Crop instance with this runner, from which fn will be set, and then combos can be sown, grown, and reaped into the Runner.last_ds. See Crop.

Return type:: Crop

__repr__()[source]¶

xyzpy.gen.farming.label(var_names, fn_args=None, var_dims=None, var_coords=None, constants=None, resources=None, attrs=None, harvester=False, sampler=False, engine=None, **default_runner_settings)[source]¶

Convenient decorator to automatically wrap a function as a Runner or Harvester.

Parameters:

var_names (str, sequence of str, or None) – The ordered name(s) of the ouput variable(s) of fn. Set this explicitly to None if fn outputs already labelled data as a dict, Dataset, or DataArray.
fn_args (str, or sequence of str, optional) – The ordered name(s) of the input arguments(s) of fn. This is only needed if the cases or combos supplied are not dict-like.
var_dims (dict-like, optional) – Mapping of output variables to their named internal dimensions, can be the names of constants.
var_coords (dict-like, optional) – Mapping of output variables named internal dimensions to the actual values they take.
constants (dict-like, optional) – Constants arguments to be supplied to fn. These can be used as ‘var_dims’, and will be saved as coords if so, otherwise as attributes.
resources (dict-like, optional) – Like constants but not saved to the the dataset, e.g. if very big.
attrs (dict-like, optional) – Any other miscelleous information to be saved with the dataset.
harvester (bool or str, optional) – If True, wrap the runner as a Harvester, if a string, create the harvester with that as the data_name.
default_runner_settings – These keyword arguments will be supplied as defaults to any runner.

Examples

Declare a function as a runner directly:

>>> import xyzpy as xyz

>>> @xyz.label(var_names=['sum', 'diff'])
... def foo(x, y):
...     return x + y, x - y
...

>>> foo
<xyzpy.Runner>
    fn: <function foo at 0x7f1fd8e5b1e0>
    fn_args: ('x', 'y')
    var_names: ('sum', 'diff')
    var_dims: {'sum': (), 'diff': ()}

>>> foo(1, 2)  # can still call it normally
(3, -1)

class xyzpy.gen.farming.Harvester(runner: Runner, data_name=None, chunks=None, engine='h5netcdf', full_ds=None)[source]¶

Bases: object

Container class for collecting and aggregating data to disk.

Parameters:

runner (Runner) – Performs the runs and describes the results.
data_name (str, optional) – Base file path to save data to.
chunks (int or dict, optional) – If not None, passed to xarray so that the full dataset is loaded and merged into with on-disk dask arrays.
engine (str, optional) – Engine to use to save and load datasets.
full_ds (xarray.Dataset) – Initialize the Harvester with this dataset as the intitial full dataset.
Members
-------
full_ds – Dataset containing all data harvested so far, by default synced to disk.
last_ds (xarray.Dataset) – Dataset containing just the data from the last harvesting run.

runner¶

data_name = None¶

engine = 'h5netcdf'¶

chunks = None¶

_full_ds = None¶

property fn¶

__call__(*args, **kwargs)[source]¶

property last_ds¶: Dataset containing the last runs’ data.

load_full_ds(chunks=None, engine=None)[source]¶

Load the disk dataset into full_ds.

Parameters:

chunks (int or dict, optional) – If not None, passed to xarray so that the full dataset is loaded and merged into with on-disk dask arrays.
engine (str, optional) – Engine to use to save and load datasets.

property full_ds¶: Dataset containing all saved runs.

save_full_ds(new_full_ds=None, engine=None)[source]¶

Save full_ds onto disk. The old file is moved and kept as a backup in case of errors when writing the new dataset to disk.

Parameters:

new_full_ds (xarray.Dataset, optional) – Save this dataset as the new full dataset, else use the current full datset.
engine (str, optional) – Engine to use to save and load datasets.

delete_ds(backup=False)[source]¶: Delete the on-disk dataset, optionally backing it up first.

add_ds(new_ds, sync=True, overwrite=None, chunks=None, engine=None)[source]¶

Merge a new dataset into the in-memory full dataset.

Parameters:

new_ds (xr.Dataset or xr.DataArray) – Data to be merged into the full dataset.
sync (bool, optional) – If True (default), load and save the disk dataset before and after merging in the new data.
overwrite ({None, False, True}, optional) –
How to combine data from the new run into the current full_ds:
- None (default): attempt the merge and only raise if data conflicts.
- True: overwrite conflicting current data with that from the new dataset.
- False: drop any conflicting data from the new dataset.
chunks (int or dict, optional) – If not None, passed to xarray so that the full dataset is loaded and merged into with on-disk dask arrays.
engine (str, optional) – Engine to use to save and load datasets.

expand_dims(name, value, engine=None)[source]¶: Add a new coordinate dimension with name and value. The change is immediately synced with the on-disk dataset. Useful if you want to expand the parameter space along a previously constant argument.

drop_sel(labels=None, *, errors='raise', engine=None, **labels_kwargs)[source]¶: Drop specific values of coordinates from this harvester and its dataset. See http://xarray.pydata.org/en/latest/generated/xarray.Dataset.drop_sel.html. The change is immediately synced with the on-disk dataset. Useful for tidying uneeded data points.

_maybe_expand_combos(combos)[source]¶: Expand combos with ellipses into full coordinate values from the current full dataset.

harvest_combos(combos, *, cases=None, missing_only=False, sync=True, overwrite=None, chunks=None, engine=None, **runner_settings)[source]¶

Run combos, automatically merging into an on-disk dataset.

Parameters:

combos (dict_like[str, iterable]) – The combos to run. The only difference here is that you can supply an ellipse ..., meaning the all values for that coordinate will be loaded from the current full dataset.
missing_only (bool, optional) – If True, only run combos that are not already present in the on-disk dataset.
sync (bool, optional) – If True (default), load and save the disk dataset before and after merging in the new data.
overwrite ({None, False, True}, optional) –
- None (default): attempt the merge and only raise if data conflicts.
- True: overwrite any conflicting current data with that from the new dataset.
- False: drop any conflicting data from the new dataset.
chunks (bool, optional) – If not None, passed passed to xarray so that the full dataset is loaded and merged into with on-disk dask arrays.
engine (str, optional) – Engine to use to save and load datasets.
runner_settings – Supplied to combo_runner().

harvest_cases(cases, *, sync=True, overwrite=None, chunks=None, engine=None, **runner_settings)[source]¶

Run cases, automatically merging into an on-disk dataset.

Parameters:

cases (list of dict or tuple) – The cases to run.
sync (bool, optional) – If True (default), load and save the disk dataset before and after merging in the new data.
overwrite ({None, False, True}, optional) –
What to do regarding clashes with old data:
- None (default): attempt the merge and only raise if data conflicts.
- True: overwrite conflicting current data with that from the new dataset.
- False: drop any conflicting data from the new dataset.
chunks (bool, optional) – If not None, passed passed to xarray so that the full dataset is loaded and merged into with on-disk dask arrays.
engine (str, optional) – Engine to use to save and load datasets.
runner_settings – Supplied to case_runner().

Crop(name=None, parent_dir=None, save_fn=None, batchsize=None, num_batches=None)[source]¶

Return a Crop instance with this Harvester, from which fn will be set, and then combos can be sown, grown, and reaped into the Harvester.full_ds. See Crop.

Return type:: Crop

__repr__()[source]¶

cultivate(combos=None, cases=None, constants=None, name=None, parent_dir=None, batchsize=None, num_batches=None, missing_only=True, shuffle=True, subprocess='auto', num_workers=None, num_threads=None, gpus=None, affinities=None, log=False, raise_errors=True, verbosity=1, on_existing='ask', on_error='ask', clean_up=None, **grow_kwargs)[source]¶

Convenience method to run a full cycle of parsing combos into missing cases only, then persistently growing those cases, and finally merging the results into the full dataset.

Parameters:

combos (dict_like[str, iterable]) – The combos to run. The only difference here is that you can supply an ellipse ..., meaning the all values for that coordinate will be loaded from the current full dataset.
cases (sequence of mappings or tuples, optional) – A sequence of (partial) individual settings to run. For each case, all settings given by combos will be generated.
constants (dict, optional) – Extra constant arguments for this run.
name (str, optional) – Name for the crop to be used for on-disk storage of batches, results and logs. You can use different names to grow results for the same dataset concurrently.
parent_dir (str, optional) – Parent directory in which to create the crop folder (.xyz-{name}/). Defaults to the current working directory.
batchsize (int, optional) – If given, the target number of cases to sow in each batch. This is computed from num_batches if not given and 1 if neither given.
num_batches (int, optional) – If given, the target number of batches to sow. This is computed from batchsize if not given and 1 if neither given.
missing_only (bool, optional) – If True (default), only run cases that are not already present in the on-disk dataset. If False, the new results will overwrite any existing results.
shuffle (bool, optional) – If True (default), shuffle the order of cases before sowing and growing. This can be a useful basic form of load balancing.
subprocess ("auto" or bool, optional) – Whether to grow each batch in a fresh subprocess. This adds about 1 second of overhead per batch, but allows the number of threads, cpu affinity and gpu assignment to be controlled. If “auto” (default) subprocesses are used when num_threads, gpus or affinities are specified. See xyzpy.Crop.grow() for details.
num_workers (int, optional) – Maximum number of batches to run concurrently. In subprocess mode this caps simultaneous subprocesses (defaults to 1 if not given). In in-process mode this is the joblib loky pool size (None = serial). Forwarded to xyzpy.Crop.grow().
num_threads (int, optional) – Number of threads each worker is allowed to use, applied via the standard env vars (OMP_NUM_THREADS, MKL_NUM_THREADS, etc.) in each subprocess. Implies subprocess=True when subprocess="auto". Forwarded to xyzpy.Crop.grow().
gpus (int, str, or sequence of int, optional) – GPU device IDs to assign to subprocesses via CUDA_VISIBLE_DEVICES; the pool also caps concurrency. Implies subprocess=True when subprocess="auto". Forwarded to xyzpy.Crop.grow().
affinities (int, str, or sequence of int, optional) – CPU core IDs to pin subprocesses to via taskset; the pool also caps concurrency. Implies subprocess=True when subprocess="auto". Forwarded to xyzpy.Crop.grow().
log (bool, optional) – Whether to save subprocess stdout and stderr to files in the crop directory under logs/batch-{batch_id}.log. Subprocess-mode only. Forwarded to xyzpy.Crop.grow().
raise_errors (bool, optional) – If True (default), raise any errors that occur during growing, otherwise just log them and continue with the next batch.
verbosity (int, optional) – The level of logging to print during the sow/grow/reap process. 0: no output, 1: progress bars, 2: progress bars with current setting postfixed.
on_existing ({'ask', 'reap', 'delete', 'skip', 'raise'}, optional) – What to do if a crop with the same name already exists on disk. Default is 'ask' (interactive prompt).
on_error ({'ask', 'reap', 'delete', 'skip', 'raise'}, optional) – What to do if an error occurs during grow/reap. Default is 'ask' (interactive prompt).
clean_up (bool or None, optional) – Whether to delete the on-disk batch, result and log files after successfully reaping.
grow_kwargs – Further keyword arguments forwarded to xyzpy.Crop.grow() (e.g. executor, min_wait, …).