xyzpy.gen.farming#

Objects for labelling and succesively running functions.

Functions

label(var_names[, fn_args, var_dims, ...])

Convenient decorator to automatically wrap a function as a Runner or Harvester.

Classes

Harvester(runner[, data_name, chunks, ...])

Container class for collecting and aggregating data to disk.

Runner(fn, var_names[, fn_args, var_dims, ...])

Container class with all the information needed to systematically run a function over many parameters and capture the output in a dataset.

Sampler(runner[, data_name, default_combos, ...])

Like a Harvester, but randomly samples combos and writes the table of results to a pandas.DataFrame.

class xyzpy.gen.farming.Harvester(runner, data_name=None, chunks=None, engine='h5netcdf', full_ds=None)[source]#

Container class for collecting and aggregating data to disk.

Parameters
  • runner (Runner) – Performs the runs and describes the results.

  • data_name (str, optional) – Base file path to save data to.

  • chunks (int or dict, optional) – If not None, passed to xarray so that the full dataset is loaded and merged into with on-disk dask arrays.

  • engine (str, optional) – Engine to use to save and load datasets.

  • full_ds (xarray.Dataset) – Initialize the Harvester with this dataset as the intitial full dataset.

  • Members

  • -------

  • full_ds – Dataset containing all data harvested so far, by default synced to disk.

  • last_ds (xarray.Dataset) – Dataset containing just the data from the last harvesting run.

Crop(name=None, parent_dir=None, save_fn=None, batchsize=None, num_batches=None)[source]#

Return a Crop instance with this Harvester, from which fn will be set, and then combos can be sown, grown, and reaped into the Harvester.full_ds. See Crop.

Return type

Crop

add_ds(new_ds, sync=True, overwrite=None, chunks=None, engine=None)[source]#

Merge a new dataset into the in-memory full dataset.

Parameters
  • new_ds (xr.Dataset or xr.DataArray) – Data to be merged into the full dataset.

  • sync (bool, optional) – If True (default), load and save the disk dataset before and after merging in the new data.

  • overwrite ({None, False, True}, optional) –

    How to combine data from the new run into the current full_ds:

    • None (default): attempt the merge and only raise if data conflicts.

    • True: overwrite conflicting current data with that from the new dataset.

    • False: drop any conflicting data from the new dataset.

  • chunks (int or dict, optional) – If not None, passed to xarray so that the full dataset is loaded and merged into with on-disk dask arrays.

  • engine (str, optional) – Engine to use to save and load datasets.

delete_ds(backup=False)[source]#

Delete the on-disk dataset, optionally backing it up first.

drop_sel(labels=None, *, errors='raise', engine=None, **labels_kwargs)[source]#

Drop specific values of coordinates from this harvester and its dataset. See http://xarray.pydata.org/en/latest/generated/xarray.Dataset.drop_sel.html. The change is immediately synced with the on-disk dataset. Useful for tidying uneeded data points.

expand_dims(name, value, engine=None)[source]#

Add a new coordinate dimension with name and value. The change is immediately synced with the on-disk dataset. Useful if you want to expand the parameter space along a previously constant argument.

property full_ds#

Dataset containing all saved runs.

harvest_cases(cases, *, sync=True, overwrite=None, chunks=None, engine=None, **runner_settings)[source]#

Run cases, automatically merging into an on-disk dataset.

Parameters
  • cases (list of dict or tuple) – The cases to run.

  • sync (bool, optional) – If True (default), load and save the disk dataset before and after merging in the new data.

  • overwrite ({None, False, True}, optional) –

    What to do regarding clashes with old data:

    • None (default): attempt the merge and only raise if data conflicts.

    • True: overwrite conflicting current data with that from the new dataset.

    • False: drop any conflicting data from the new dataset.

  • chunks (bool, optional) – If not None, passed passed to xarray so that the full dataset is loaded and merged into with on-disk dask arrays.

  • engine (str, optional) – Engine to use to save and load datasets.

  • runner_settings – Supplied to case_runner().

harvest_combos(combos, *, sync=True, overwrite=None, chunks=None, engine=None, **runner_settings)[source]#

Run combos, automatically merging into an on-disk dataset.

Parameters
  • combos (dict_like[str, iterable]) – The combos to run. The only difference here is that you can supply an ellipse ..., meaning the all values for that coordinate will be loaded from the current full dataset.

  • sync (bool, optional) – If True (default), load and save the disk dataset before and after merging in the new data.

  • overwrite ({None, False, True}, optional) –

    • None (default): attempt the merge and only raise if data conflicts.

    • True: overwrite conflicting current data with that from the new dataset.

    • False: drop any conflicting data from the new dataset.

  • chunks (bool, optional) – If not None, passed passed to xarray so that the full dataset is loaded and merged into with on-disk dask arrays.

  • engine (str, optional) – Engine to use to save and load datasets.

  • runner_settings – Supplied to combo_runner().

property last_ds#

Dataset containing the last runs’ data.

load_full_ds(chunks=None, engine=None)[source]#

Load the disk dataset into full_ds.

Parameters
  • chunks (int or dict, optional) – If not None, passed to xarray so that the full dataset is loaded and merged into with on-disk dask arrays.

  • engine (str, optional) – Engine to use to save and load datasets.

save_full_ds(new_full_ds=None, engine=None)[source]#

Save full_ds onto disk.

Parameters
  • new_full_ds (xarray.Dataset, optional) – Save this dataset as the new full dataset, else use the current full datset.

  • engine (str, optional) – Engine to use to save and load datasets.

class xyzpy.gen.farming.Runner(fn, var_names, fn_args=None, var_dims=None, var_coords=None, constants=None, resources=None, attrs=None, **default_runner_settings)[source]#

Container class with all the information needed to systematically run a function over many parameters and capture the output in a dataset.

Parameters
  • fn (callable) – Function that produces a single instance of a result.

  • var_names (str, sequence of str, or None) – The ordered name(s) of the ouput variable(s) of fn. Set this explicitly to None if fn outputs already labelled data as a Dataset or DataArray.

  • fn_args (str, or sequence of str, optional) – The ordered name(s) of the input arguments(s) of fn. This is only needed if the cases or combos supplied are not dict-like.

  • var_dims (dict-like, optional) – Mapping of output variables to their named internal dimensions, can be the names of constants.

  • var_coords (dict-like, optional) – Mapping of output variables named internal dimensions to the actual values they take.

  • constants (dict-like, optional) – Constants arguments to be supplied to fn. These can be used as ‘var_dims’, and will be saved as coords if so, otherwise as attributes.

  • resources (dict-like, optional) – Like constants but not saved to the the dataset, e.g. if very big.

  • attrs (dict-like, optional) – Any other miscelleous information to be saved with the dataset.

  • default_runner_settings – These keyword arguments will be supplied as defaults to any runner.

Crop(name=None, parent_dir=None, save_fn=None, batchsize=None, num_batches=None)[source]#

Return a Crop instance with this runner, from which fn will be set, and then combos can be sown, grown, and reaped into the Runner.last_ds. See Crop.

Return type

Crop

property constants#

Mapping of constant arguments supplied to the Runner’s function.

property fn_args#

List of the names of the arguments that the Runner’s function takes.

property resources#

Mapping of constant arguments supplied to the Runner’s function that are not saved with the dataset.

run_cases(cases, constants=(), fn_args=None, **runner_settings)[source]#

Run cases using the function and save to dataset.

Parameters
  • cases (sequence of mappings or tuples) – A sequence of cases.

  • constants (dict (optional)) – Extra constant arguments for this run, repeated arguments will take precedence over stored constants but for this run only.

  • runner_settings – Supplied to case_runner().

run_combos(combos, constants=(), **runner_settings)[source]#

Run combos using the function map and save to dataset.

Parameters
  • combos (dict_like[str, iterable]) – The values of each function argument with which to evaluate all combinations.

  • constants (dict, optional) – Extra constant arguments for this run, repeated arguments will take precedence over stored constants but for this run only.

  • runner_settings – Keyword arguments supplied to combo_runner().

property var_coords#

Mapping of each variable named dimension to its coordinate values.

property var_dims#

Mapping of each output variable to its named dimensions

property var_names#

List of the names of the variables that the Runner’s function produces.

class xyzpy.gen.farming.Sampler(runner, data_name=None, default_combos=None, full_df=None, engine='pickle')[source]#

Like a Harvester, but randomly samples combos and writes the table of results to a pandas.DataFrame.

Parameters
  • runner (xyzpy.Runner) – Runner describing a labelled function to run.

  • data_name (str, optional) – If given, the on-disk file to sync results with.

  • default_combos (dict_like[str, iterable], optional) – The default combos to sample from (which can be overridden).

  • full_df (pandas.DataFrame, optional) – If given, use this dataframe as the initial ‘full’ data.

  • engine ({'pickle', 'csv', 'json', 'hdf', ...}, optional) – How to save and load the on-disk dataframe. See load_df() and save_df().

full_df#

Dataframe describing all data harvested so far.

Type

pandas.DataFrame

last_df#

Dataframe describing the data harvested on the previous run.

Type

pandas.Dataframe

Crop(name=None, parent_dir=None, save_fn=None, batchsize=None, num_batches=None)[source]#

Return a Crop instance with this Sampler, from which fn will be set, and then samples can be sown, grown, and reaped into the Sampler.full_df. See Crop.

Return type

Crop

add_df(new_df, sync=True, engine=None)[source]#

Merge a new dataset into the in-memory full dataset.

Parameters
  • new_df (pandas.DataFrame or dict) – Data to be appended to the full dataset.

  • sync (bool, optional) – If True (default), load and save the disk dataframe before and after merging in the new data.

  • engine (str, optional) – Which engine to save the dataframe with.

delete_df(backup=False)[source]#

Delete the on-disk dataframe, optionally backing it up first.

property full_df#

The dataframe describing all data harvested so far.

gen_cases_fnargs(n, combos=None)[source]#
property last_df#

The dataframe describing the last set of data harvested.

load_full_df(engine=None)[source]#

Load the on-disk full dataframe into memory.

sample_combos(n, combos=None, engine=None, **case_runner_settings)[source]#

Sample the target function many times, randomly choosing parameter combinations from combos (or SampleHarvester.default_combos).

Parameters
  • n (int) – How many samples to run.

  • combos (dict_like[str, iterable], optional) – A mapping of function arguments to potential choices. Any keys in here will override default_combos. You can also suppply a callable to manually return a random choice e.g. from a probability distribution.

  • engine (str, optional) – Which method to use to sync with the on-disk dataframe.

  • case_runner_settings – Supplied to case_runner() and so onto combo_runner(). This includes parallel=True etc.

save_full_df(new_full_df=None, engine=None)[source]#

Save full_df onto disk.

Parameters
  • new_full_df (pandas.DataFrame, optional) – Save this dataframe as the new full dataframe, else use the current full_df.

  • engine (str, optional) – Which engine to save the dataframe with, if None use the default.

xyzpy.gen.farming.label(var_names, fn_args=None, var_dims=None, var_coords=None, constants=None, resources=None, attrs=None, harvester=False, sampler=False, **default_runner_settings)[source]#

Convenient decorator to automatically wrap a function as a Runner or Harvester.

Parameters
  • var_names (str, sequence of str, or None) – The ordered name(s) of the ouput variable(s) of fn. Set this explicitly to None if fn outputs already labelled data as a Dataset or DataArray.

  • fn_args (str, or sequence of str, optional) – The ordered name(s) of the input arguments(s) of fn. This is only needed if the cases or combos supplied are not dict-like.

  • var_dims (dict-like, optional) – Mapping of output variables to their named internal dimensions, can be the names of constants.

  • var_coords (dict-like, optional) – Mapping of output variables named internal dimensions to the actual values they take.

  • constants (dict-like, optional) – Constants arguments to be supplied to fn. These can be used as ‘var_dims’, and will be saved as coords if so, otherwise as attributes.

  • resources (dict-like, optional) – Like constants but not saved to the the dataset, e.g. if very big.

  • attrs (dict-like, optional) – Any other miscelleous information to be saved with the dataset.

  • harvester (bool or str, optional) – If True, wrap the runner as a Harvester, if a string, create the harvester with that as the data_name.

  • default_runner_settings – These keyword arguments will be supplied as defaults to any runner.

Examples

Declare a function as a runner directly:

>>> import xyzpy as xyz

>>> @xyz.label(var_names=['sum', 'diff'])
... def foo(x, y):
...     return x + y, x - y
...

>>> foo
<xyzpy.Runner>
    fn: <function foo at 0x7f1fd8e5b1e0>
    fn_args: ('x', 'y')
    var_names: ('sum', 'diff')
    var_dims: {'sum': (), 'diff': ()}

>>> foo(1, 2)  # can still call it normally
(3, -1)