xyzpy.gen.farming

Objects for labelling and succesively running functions.

Classes

Runner

Container class with all the information needed to systematically

Harvester

Container class for collecting and aggregating data to disk.

Sampler

Like a Harvester, but randomly samples combos and writes the table of

Functions

label(var_names[, fn_args, var_dims, var_coords, ...])

Convenient decorator to automatically wrap a function as a

cultivate(fn, *[, var_names, data_name, runner_opts, ...])

Convenience function to run a full cycle of annotating a function,

Module Contents

class xyzpy.gen.farming.Runner(fn, var_names, fn_args=None, var_dims=None, var_coords=None, constants=None, resources=None, attrs=None, **default_runner_settings)[source]

Bases: object

Container class with all the information needed to systematically run a function over many parameters and capture the output in a dataset.

Parameters:
  • fn (callable) – Function that produces a single instance of a result.

  • var_names (str, sequence of str, or None) – The ordered name(s) of the ouput variable(s) of fn. Set this explicitly to None if fn outputs already labelled data as a dict, Dataset, or DataArray.

  • fn_args (str, or sequence of str, optional) – The ordered name(s) of the input arguments(s) of fn. This is only needed if the cases or combos supplied are not dict-like.

  • var_dims (dict-like, optional) – Mapping of output variables to their named internal dimensions, can be the names of constants.

  • var_coords (dict-like, optional) – Mapping of output variables named internal dimensions to the actual values they take.

  • constants (dict-like, optional) – Constants arguments to be supplied to fn. These can be used as ‘var_dims’, and will be saved as coords if so, otherwise as attributes.

  • resources (dict-like, optional) – Like constants but not saved to the the dataset, e.g. if very big.

  • attrs (dict-like, optional) – Any other miscelleous information to be saved with the dataset.

  • default_runner_settings – These keyword arguments will be supplied as defaults to any runner.

fn
_var_names = (None,)
_fn_args
_var_dims
_var_coords
_constants
_resources
_attrs
_last_ds = None
default_runner_settings
__call__(*args, **kwargs)[source]
_get_fn_args()[source]
_set_fn_args(fn_args)[source]
_del_fn_args()[source]
fn_args
_get_var_names()[source]
_set_var_names(var_names)[source]
_del_var_names()[source]
var_names
_get_var_dims()[source]
_set_var_dims(var_dims, var_names=None)[source]
_del_var_dims()[source]
var_dims
_get_var_coords()[source]
_set_var_coords(var_coords)[source]
_del_var_coords()[source]
var_coords
_get_constants()[source]
_set_constants(constants)[source]
_del_constants()[source]
constants
_get_resources()[source]
_set_resources(resources)[source]
_del_resources()[source]
resources
property last_ds
run_combos(combos, constants=(), **runner_settings)[source]

Run combos using the function map and save to dataset.

Parameters:
  • combos (dict_like[str, iterable]) – The values of each function argument with which to evaluate all combinations.

  • constants (dict, optional) – Extra constant arguments for this run, repeated arguments will take precedence over stored constants but for this run only.

  • runner_settings – Keyword arguments supplied to combo_runner().

run_cases(cases, constants=(), fn_args=None, **runner_settings)[source]

Run cases using the function and save to dataset.

Parameters:
  • cases (sequence of mappings or tuples) – A sequence of cases.

  • constants (dict (optional)) – Extra constant arguments for this run, repeated arguments will take precedence over stored constants but for this run only.

  • runner_settings – Supplied to case_runner().

Crop(name=None, parent_dir=None, save_fn=None, batchsize=None, num_batches=None)[source]

Return a Crop instance with this runner, from which fn will be set, and then combos can be sown, grown, and reaped into the Runner.last_ds. See Crop.

Return type:

Crop

__repr__()[source]
xyzpy.gen.farming.label(var_names, fn_args=None, var_dims=None, var_coords=None, constants=None, resources=None, attrs=None, harvester=False, sampler=False, engine=None, **default_runner_settings)[source]

Convenient decorator to automatically wrap a function as a Runner or Harvester.

Parameters:
  • var_names (str, sequence of str, or None) – The ordered name(s) of the ouput variable(s) of fn. Set this explicitly to None if fn outputs already labelled data as a dict, Dataset, or DataArray.

  • fn_args (str, or sequence of str, optional) – The ordered name(s) of the input arguments(s) of fn. This is only needed if the cases or combos supplied are not dict-like.

  • var_dims (dict-like, optional) – Mapping of output variables to their named internal dimensions, can be the names of constants.

  • var_coords (dict-like, optional) – Mapping of output variables named internal dimensions to the actual values they take.

  • constants (dict-like, optional) – Constants arguments to be supplied to fn. These can be used as ‘var_dims’, and will be saved as coords if so, otherwise as attributes.

  • resources (dict-like, optional) – Like constants but not saved to the the dataset, e.g. if very big.

  • attrs (dict-like, optional) – Any other miscelleous information to be saved with the dataset.

  • harvester (bool or str, optional) – If True, wrap the runner as a Harvester, if a string, create the harvester with that as the data_name.

  • default_runner_settings – These keyword arguments will be supplied as defaults to any runner.

Examples

Declare a function as a runner directly:

>>> import xyzpy as xyz

>>> @xyz.label(var_names=['sum', 'diff'])
... def foo(x, y):
...     return x + y, x - y
...

>>> foo
<xyzpy.Runner>
    fn: <function foo at 0x7f1fd8e5b1e0>
    fn_args: ('x', 'y')
    var_names: ('sum', 'diff')
    var_dims: {'sum': (), 'diff': ()}

>>> foo(1, 2)  # can still call it normally
(3, -1)
class xyzpy.gen.farming.Harvester(runner: Runner, data_name=None, chunks=None, engine='h5netcdf', full_ds=None)[source]

Bases: object

Container class for collecting and aggregating data to disk.

Parameters:
  • runner (Runner) – Performs the runs and describes the results.

  • data_name (str, optional) – Base file path to save data to.

  • chunks (int or dict, optional) – If not None, passed to xarray so that the full dataset is loaded and merged into with on-disk dask arrays.

  • engine (str, optional) – Engine to use to save and load datasets.

  • full_ds (xarray.Dataset) – Initialize the Harvester with this dataset as the intitial full dataset.

  • Members

  • -------

  • full_ds – Dataset containing all data harvested so far, by default synced to disk.

  • last_ds (xarray.Dataset) – Dataset containing just the data from the last harvesting run.

runner
data_name = None
engine = 'h5netcdf'
chunks = None
_full_ds = None
property fn
__call__(*args, **kwargs)[source]
property last_ds

Dataset containing the last runs’ data.

load_full_ds(chunks=None, engine=None)[source]

Load the disk dataset into full_ds.

Parameters:
  • chunks (int or dict, optional) – If not None, passed to xarray so that the full dataset is loaded and merged into with on-disk dask arrays.

  • engine (str, optional) – Engine to use to save and load datasets.

property full_ds

Dataset containing all saved runs.

save_full_ds(new_full_ds=None, engine=None)[source]

Save full_ds onto disk. The old file is moved and kept as a backup in case of errors when writing the new dataset to disk.

Parameters:
  • new_full_ds (xarray.Dataset, optional) – Save this dataset as the new full dataset, else use the current full datset.

  • engine (str, optional) – Engine to use to save and load datasets.

delete_ds(backup=False)[source]

Delete the on-disk dataset, optionally backing it up first.

add_ds(new_ds, sync=True, overwrite=None, chunks=None, engine=None)[source]

Merge a new dataset into the in-memory full dataset.

Parameters:
  • new_ds (xr.Dataset or xr.DataArray) – Data to be merged into the full dataset.

  • sync (bool, optional) – If True (default), load and save the disk dataset before and after merging in the new data.

  • overwrite ({None, False, True}, optional) –

    How to combine data from the new run into the current full_ds:

    • None (default): attempt the merge and only raise if data conflicts.

    • True: overwrite conflicting current data with that from the new dataset.

    • False: drop any conflicting data from the new dataset.

  • chunks (int or dict, optional) – If not None, passed to xarray so that the full dataset is loaded and merged into with on-disk dask arrays.

  • engine (str, optional) – Engine to use to save and load datasets.

expand_dims(name, value, engine=None)[source]

Add a new coordinate dimension with name and value. The change is immediately synced with the on-disk dataset. Useful if you want to expand the parameter space along a previously constant argument.

drop_sel(labels=None, *, errors='raise', engine=None, **labels_kwargs)[source]

Drop specific values of coordinates from this harvester and its dataset. See http://xarray.pydata.org/en/latest/generated/xarray.Dataset.drop_sel.html. The change is immediately synced with the on-disk dataset. Useful for tidying uneeded data points.

_maybe_expand_combos(combos)[source]

Expand combos with ellipses into full coordinate values from the current full dataset.

harvest_combos(combos, *, cases=None, missing_only=False, sync=True, overwrite=None, chunks=None, engine=None, **runner_settings)[source]

Run combos, automatically merging into an on-disk dataset.

Parameters:
  • combos (dict_like[str, iterable]) – The combos to run. The only difference here is that you can supply an ellipse ..., meaning the all values for that coordinate will be loaded from the current full dataset.

  • missing_only (bool, optional) – If True, only run combos that are not already present in the on-disk dataset.

  • sync (bool, optional) – If True (default), load and save the disk dataset before and after merging in the new data.

  • overwrite ({None, False, True}, optional) –

    • None (default): attempt the merge and only raise if data conflicts.

    • True: overwrite any conflicting current data with that from the new dataset.

    • False: drop any conflicting data from the new dataset.

  • chunks (bool, optional) – If not None, passed passed to xarray so that the full dataset is loaded and merged into with on-disk dask arrays.

  • engine (str, optional) – Engine to use to save and load datasets.

  • runner_settings – Supplied to combo_runner().

harvest_cases(cases, *, sync=True, overwrite=None, chunks=None, engine=None, **runner_settings)[source]

Run cases, automatically merging into an on-disk dataset.

Parameters:
  • cases (list of dict or tuple) – The cases to run.

  • sync (bool, optional) – If True (default), load and save the disk dataset before and after merging in the new data.

  • overwrite ({None, False, True}, optional) –

    What to do regarding clashes with old data:

    • None (default): attempt the merge and only raise if data conflicts.

    • True: overwrite conflicting current data with that from the new dataset.

    • False: drop any conflicting data from the new dataset.

  • chunks (bool, optional) – If not None, passed passed to xarray so that the full dataset is loaded and merged into with on-disk dask arrays.

  • engine (str, optional) – Engine to use to save and load datasets.

  • runner_settings – Supplied to case_runner().

Crop(name=None, parent_dir=None, save_fn=None, batchsize=None, num_batches=None)[source]

Return a Crop instance with this Harvester, from which fn will be set, and then combos can be sown, grown, and reaped into the Harvester.full_ds. See Crop.

Return type:

Crop

__repr__()[source]
cultivate(combos=None, cases=None, constants=None, name=None, parent_dir=None, batchsize=None, num_batches=None, missing_only=True, shuffle=True, subprocess='auto', num_workers=None, num_threads=None, gpus=None, affinities=None, log=False, raise_errors=True, verbosity=1, on_existing='ask', on_error='ask', clean_up=None, **grow_kwargs)[source]

Convenience method to run a full cycle of parsing combos into missing cases only, then persistently growing those cases, and finally merging the results into the full dataset.

Parameters:
  • combos (dict_like[str, iterable]) – The combos to run. The only difference here is that you can supply an ellipse ..., meaning the all values for that coordinate will be loaded from the current full dataset.

  • cases (sequence of mappings or tuples, optional) – A sequence of (partial) individual settings to run. For each case, all settings given by combos will be generated.

  • constants (dict, optional) – Extra constant arguments for this run.

  • name (str, optional) – Name for the crop to be used for on-disk storage of batches, results and logs. You can use different names to grow results for the same dataset concurrently.

  • parent_dir (str, optional) – Parent directory in which to create the crop folder (.xyz-{name}/). Defaults to the current working directory.

  • batchsize (int, optional) – If given, the target number of cases to sow in each batch. This is computed from num_batches if not given and 1 if neither given.

  • num_batches (int, optional) – If given, the target number of batches to sow. This is computed from batchsize if not given and 1 if neither given.

  • missing_only (bool, optional) – If True (default), only run cases that are not already present in the on-disk dataset. If False, the new results will overwrite any existing results.

  • shuffle (bool, optional) – If True (default), shuffle the order of cases before sowing and growing. This can be a useful basic form of load balancing.

  • subprocess ("auto" or bool, optional) – Whether to grow each batch in a fresh subprocess. This adds about 1 second of overhead per batch, but allows the number of threads, cpu affinity and gpu assignment to be controlled. If “auto” (default) subprocesses are used when num_threads, gpus or affinities are specified. See xyzpy.Crop.grow() for details.

  • num_workers (int, optional) – Maximum number of batches to run concurrently. In subprocess mode this caps simultaneous subprocesses (defaults to 1 if not given). In in-process mode this is the joblib loky pool size (None = serial). Forwarded to xyzpy.Crop.grow().

  • num_threads (int, optional) – Number of threads each worker is allowed to use, applied via the standard env vars (OMP_NUM_THREADS, MKL_NUM_THREADS, etc.) in each subprocess. Implies subprocess=True when subprocess="auto". Forwarded to xyzpy.Crop.grow().

  • gpus (int, str, or sequence of int, optional) – GPU device IDs to assign to subprocesses via CUDA_VISIBLE_DEVICES; the pool also caps concurrency. Implies subprocess=True when subprocess="auto". Forwarded to xyzpy.Crop.grow().

  • affinities (int, str, or sequence of int, optional) – CPU core IDs to pin subprocesses to via taskset; the pool also caps concurrency. Implies subprocess=True when subprocess="auto". Forwarded to xyzpy.Crop.grow().

  • log (bool, optional) – Whether to save subprocess stdout and stderr to files in the crop directory under logs/batch-{batch_id}.log. Subprocess-mode only. Forwarded to xyzpy.Crop.grow().

  • raise_errors (bool, optional) – If True (default), raise any errors that occur during growing, otherwise just log them and continue with the next batch.

  • verbosity (int, optional) – The level of logging to print during the sow/grow/reap process. 0: no output, 1: progress bars, 2: progress bars with current setting postfixed.

  • on_existing ({'ask', 'reap', 'delete', 'skip', 'raise'}, optional) – What to do if a crop with the same name already exists on disk. Default is 'ask' (interactive prompt).

  • on_error ({'ask', 'reap', 'delete', 'skip', 'raise'}, optional) – What to do if an error occurs during grow/reap. Default is 'ask' (interactive prompt).

  • clean_up (bool or None, optional) – Whether to delete the on-disk batch, result and log files after successfully reaping.

  • grow_kwargs – Further keyword arguments forwarded to xyzpy.Crop.grow() (e.g. executor, min_wait, …).

xyzpy.gen.farming.cultivate(fn, *, var_names=None, data_name=None, runner_opts=None, harvester_opts=None, combos=None, cases=None, constants=None, name=None, parent_dir=None, batchsize=None, num_batches=None, missing_only=True, shuffle=True, subprocess='auto', num_workers=None, num_threads=None, gpus=None, affinities=None, log=False, raise_errors=True, verbosity=1, on_existing='ask', on_error='ask', clean_up=None, **grow_kwargs)[source]

Convenience function to run a full cycle of annotating a function, parsing combos into missing cases only, then persistently growing those cases, and finally merging the results into the full dataset.

Parameters:
  • fn (callable) – The function to run over combos and cases. This will be wrapped in a Runner and Harvester to perform the cultivation process. If var_names is None, it should return a dict, Dataset or DataArray.

  • var_names (str, sequence of str, or None) – The ordered name(s) of the ouput variable(s) of fn. Set this explicitly to None if fn outputs already labelled data as a dict, Dataset, or DataArray.

  • data_name (str, optional) – If given, the on-disk file to sync results with. If not set there will be no persistent results, since the harvester created in this functional interface is ephemeral.

  • runner_opts (dict, optional) – Keyword arguments to be supplied to Runner.

  • harvester_opts (dict, optional) – Keyword arguments to be supplied to Harvester.

  • combos (dict_like[str, iterable]) – The combos to run. The only difference here is that you can supply an ellipse ..., meaning the all values for that coordinate will be loaded from the current full dataset.

  • cases (sequence of mappings or tuples, optional) – A sequence of (partial) individual settings to run. For each case, all settings given by combos will be generated.

  • constants (dict, optional) – Extra constant arguments for this run.

  • name (str, optional) – Name for the crop to be used for on-disk storage of batches, results and logs. You can use different names to grow results for the same dataset concurrently.

  • parent_dir (str, optional) – Parent directory in which to create the crop folder (.xyz-{name}/). Defaults to the current working directory.

  • batchsize (int, optional) – If given, the target number of cases to sow in each batch. This is computed from num_batches if not given and 1 if neither given.

  • num_batches (int, optional) – If given, the target number of batches to sow. This is computed from batchsize if not given and 1 if neither given.

  • missing_only (bool, optional) – If True (default), only run cases that are not already present in the on-disk dataset

  • shuffle (bool, optional) – If True (default), shuffle the order of cases before sowing and growing. This can be a useful basic form of load balancing.

  • subprocess ("auto" or bool, optional) – Whether to grow each batch in a fresh subprocess. This adds about 1 second of overhead per batch, but allows the number of threads, cpu affinity and gpu assignment to be controlled. If “auto” (default) subprocesses are used when num_threads, gpus or affinities are specified. See xyzpy.Crop.grow() for details.

  • num_workers (int, optional) – Maximum number of batches to run concurrently. In subprocess mode this caps simultaneous subprocesses (defaults to 1 if not given). In in-process mode this is the joblib loky pool size (None = serial). Forwarded to xyzpy.Crop.grow().

  • num_threads (int, optional) – Number of threads each worker is allowed to use, applied via the standard env vars (OMP_NUM_THREADS, MKL_NUM_THREADS, etc.) in each subprocess. Implies subprocess=True when subprocess="auto". Forwarded to xyzpy.Crop.grow().

  • gpus (int, str, or sequence of int, optional) – GPU device IDs to assign to subprocesses via CUDA_VISIBLE_DEVICES; the pool also caps concurrency. Implies subprocess=True when subprocess="auto". Forwarded to xyzpy.Crop.grow().

  • affinities (int, str, or sequence of int, optional) – CPU core IDs to pin subprocesses to via taskset; the pool also caps concurrency. Implies subprocess=True when subprocess="auto". Forwarded to xyzpy.Crop.grow().

  • log (bool, optional) – Whether to save subprocess stdout and stderr to files in the crop directory under logs/batch-{batch_id}.log. Subprocess-mode only. Forwarded to xyzpy.Crop.grow().

  • raise_errors (bool, optional) – If True (default), raise any errors that occur during growing, otherwise just log them and continue with the next batch.

  • verbosity (int, optional) – The level of logging to print during the sow/grow/reap process. 0: no output, 1: progress bars, 2: progress bars with current setting postfixed.

  • on_existing ({'ask', 'reap', 'delete', 'skip', 'raise'}, optional) – What to do if a crop with the same name already exists on disk. Default is 'ask' (interactive prompt).

  • on_error ({'ask', 'reap', 'delete', 'skip', 'raise'}, optional) – What to do if an error occurs during grow/reap. Default is 'ask' (interactive prompt).

  • clean_up (bool or None, optional) – Whether to delete the on-disk batch, result and log files after successfully reaping.

  • grow_kwargs – Further keyword arguments forwarded to xyzpy.Crop.grow() (e.g. executor, min_wait, …).

class xyzpy.gen.farming.Sampler(runner, data_name=None, default_combos=None, full_df=None, engine='pickle')[source]

Like a Harvester, but randomly samples combos and writes the table of results to a pandas.DataFrame.

Parameters:
  • runner (xyzpy.Runner) – Runner describing a labelled function to run.

  • data_name (str, optional) – If given, the on-disk file to sync results with.

  • default_combos (dict_like[str, iterable], optional) – The default combos to sample from (which can be overridden).

  • full_df (pandas.DataFrame, optional) – If given, use this dataframe as the initial ‘full’ data.

  • engine ({'pickle', 'csv', 'json', 'hdf', ...}, optional) – How to save and load the on-disk dataframe. See load_df() and save_df().

full_df

Dataframe describing all data harvested so far.

Type:

pandas.DataFrame

last_df

Dataframe describing the data harvested on the previous run.

Type:

pandas.Dataframe

runner
data_name = None
default_combos
_full_df = None
_last_df = None
engine = 'pickle'
property fn
load_full_df(engine=None)[source]

Load the on-disk full dataframe into memory.

property full_df

The dataframe describing all data harvested so far.

property last_df

The dataframe describing the last set of data harvested.

save_full_df(new_full_df=None, engine=None)[source]

Save full_df onto disk.

Parameters:
  • new_full_df (pandas.DataFrame, optional) – Save this dataframe as the new full dataframe, else use the current full_df.

  • engine (str, optional) – Which engine to save the dataframe with, if None use the default.

delete_df(backup=False)[source]

Delete the on-disk dataframe, optionally backing it up first.

add_df(new_df, sync=True, engine=None)[source]

Merge a new dataset into the in-memory full dataset.

Parameters:
  • new_df (pandas.DataFrame or dict) – Data to be appended to the full dataset.

  • sync (bool, optional) – If True (default), load and save the disk dataframe before and after merging in the new data.

  • engine (str, optional) – Which engine to save the dataframe with.

gen_cases_fnargs(n, combos=None)[source]
sample_combos(n, combos=None, engine=None, **case_runner_settings)[source]

Sample the target function many times, randomly choosing parameter combinations from combos (or SampleHarvester.default_combos).

Parameters:
  • n (int) – How many samples to run.

  • combos (dict_like[str, iterable], optional) – A mapping of function arguments to potential choices. Any keys in here will override default_combos. You can also suppply a callable to manually return a random choice e.g. from a probability distribution.

  • engine (str, optional) – Which method to use to sync with the on-disk dataframe.

  • case_runner_settings – Supplied to case_runner() and so onto combo_runner(). This includes parallel=True etc.

Crop(name=None, parent_dir=None, save_fn=None, batchsize=None, num_batches=None)[source]

Return a Crop instance with this Sampler, from which fn will be set, and then samples can be sown, grown, and reaped into the Sampler.full_df. See Crop.

Return type:

Crop

__repr__()[source]