xyzpy.gen.farming#

Objects for labelling and succesively running functions.

Module Contents#

Classes#

Runner

Container class with all the information needed to systematically

Harvester

Container class for collecting and aggregating data to disk.

Sampler

Like a Harvester, but randomly samples combos and writes the table of

Functions#

label(var_names[, fn_args, var_dims, var_coords, ...])

Convenient decorator to automatically wrap a function as a

class xyzpy.gen.farming.Runner(fn, var_names, fn_args=None, var_dims=None, var_coords=None, constants=None, resources=None, attrs=None, **default_runner_settings)[source]#

Bases: object

Container class with all the information needed to systematically run a function over many parameters and capture the output in a dataset.

Parameters:
  • fn (callable) – Function that produces a single instance of a result.

  • var_names (str, sequence of str, or None) – The ordered name(s) of the ouput variable(s) of fn. Set this explicitly to None if fn outputs already labelled data as a Dataset or DataArray.

  • fn_args (str, or sequence of str, optional) – The ordered name(s) of the input arguments(s) of fn. This is only needed if the cases or combos supplied are not dict-like.

  • var_dims (dict-like, optional) – Mapping of output variables to their named internal dimensions, can be the names of constants.

  • var_coords (dict-like, optional) – Mapping of output variables named internal dimensions to the actual values they take.

  • constants (dict-like, optional) – Constants arguments to be supplied to fn. These can be used as ‘var_dims’, and will be saved as coords if so, otherwise as attributes.

  • resources (dict-like, optional) – Like constants but not saved to the the dataset, e.g. if very big.

  • attrs (dict-like, optional) – Any other miscelleous information to be saved with the dataset.

  • default_runner_settings – These keyword arguments will be supplied as defaults to any runner.

property last_ds#
fn_args#
var_names#
var_dims#
var_coords#
constants#
resources#
__call__(*args, **kwargs)[source]#
_get_fn_args()[source]#
_set_fn_args(fn_args)[source]#
_del_fn_args()[source]#
_get_var_names()[source]#
_set_var_names(var_names)[source]#
_del_var_names()[source]#
_get_var_dims()[source]#
_set_var_dims(var_dims, var_names=None)[source]#
_del_var_dims()[source]#
_get_var_coords()[source]#
_set_var_coords(var_coords)[source]#
_del_var_coords()[source]#
_get_constants()[source]#
_set_constants(constants)[source]#
_del_constants()[source]#
_get_resources()[source]#
_set_resources(resources)[source]#
_del_resources()[source]#
run_combos(combos, constants=(), **runner_settings)[source]#

Run combos using the function map and save to dataset.

Parameters:
  • combos (dict_like[str, iterable]) – The values of each function argument with which to evaluate all combinations.

  • constants (dict, optional) – Extra constant arguments for this run, repeated arguments will take precedence over stored constants but for this run only.

  • runner_settings – Keyword arguments supplied to combo_runner().

run_cases(cases, constants=(), fn_args=None, **runner_settings)[source]#

Run cases using the function and save to dataset.

Parameters:
  • cases (sequence of mappings or tuples) – A sequence of cases.

  • constants (dict (optional)) – Extra constant arguments for this run, repeated arguments will take precedence over stored constants but for this run only.

  • runner_settings – Supplied to case_runner().

Crop(name=None, parent_dir=None, save_fn=None, batchsize=None, num_batches=None)[source]#

Return a Crop instance with this runner, from which fn will be set, and then combos can be sown, grown, and reaped into the Runner.last_ds. See Crop.

Return type:

Crop

__repr__()[source]#

Return repr(self).

xyzpy.gen.farming.label(var_names, fn_args=None, var_dims=None, var_coords=None, constants=None, resources=None, attrs=None, harvester=False, sampler=False, engine=None, **default_runner_settings)[source]#

Convenient decorator to automatically wrap a function as a Runner or Harvester.

Parameters:
  • var_names (str, sequence of str, or None) – The ordered name(s) of the ouput variable(s) of fn. Set this explicitly to None if fn outputs already labelled data as a Dataset or DataArray.

  • fn_args (str, or sequence of str, optional) – The ordered name(s) of the input arguments(s) of fn. This is only needed if the cases or combos supplied are not dict-like.

  • var_dims (dict-like, optional) – Mapping of output variables to their named internal dimensions, can be the names of constants.

  • var_coords (dict-like, optional) – Mapping of output variables named internal dimensions to the actual values they take.

  • constants (dict-like, optional) – Constants arguments to be supplied to fn. These can be used as ‘var_dims’, and will be saved as coords if so, otherwise as attributes.

  • resources (dict-like, optional) – Like constants but not saved to the the dataset, e.g. if very big.

  • attrs (dict-like, optional) – Any other miscelleous information to be saved with the dataset.

  • harvester (bool or str, optional) – If True, wrap the runner as a Harvester, if a string, create the harvester with that as the data_name.

  • default_runner_settings – These keyword arguments will be supplied as defaults to any runner.

Examples

Declare a function as a runner directly:

>>> import xyzpy as xyz

>>> @xyz.label(var_names=['sum', 'diff'])
... def foo(x, y):
...     return x + y, x - y
...

>>> foo
<xyzpy.Runner>
    fn: <function foo at 0x7f1fd8e5b1e0>
    fn_args: ('x', 'y')
    var_names: ('sum', 'diff')
    var_dims: {'sum': (), 'diff': ()}

>>> foo(1, 2)  # can still call it normally
(3, -1)
class xyzpy.gen.farming.Harvester(runner, data_name=None, chunks=None, engine='h5netcdf', full_ds=None)[source]#

Bases: object

Container class for collecting and aggregating data to disk.

Parameters:
  • runner (Runner) – Performs the runs and describes the results.

  • data_name (str, optional) – Base file path to save data to.

  • chunks (int or dict, optional) – If not None, passed to xarray so that the full dataset is loaded and merged into with on-disk dask arrays.

  • engine (str, optional) – Engine to use to save and load datasets.

  • full_ds (xarray.Dataset) – Initialize the Harvester with this dataset as the intitial full dataset.

  • Members

  • -------

  • full_ds – Dataset containing all data harvested so far, by default synced to disk.

  • last_ds (xarray.Dataset) – Dataset containing just the data from the last harvesting run.

property fn#
property last_ds#

Dataset containing the last runs’ data.

property full_ds#

Dataset containing all saved runs.

__call__(*args, **kwargs)[source]#
load_full_ds(chunks=None, engine=None)[source]#

Load the disk dataset into full_ds.

Parameters:
  • chunks (int or dict, optional) – If not None, passed to xarray so that the full dataset is loaded and merged into with on-disk dask arrays.

  • engine (str, optional) – Engine to use to save and load datasets.

save_full_ds(new_full_ds=None, engine=None)[source]#

Save full_ds onto disk.

Parameters:
  • new_full_ds (xarray.Dataset, optional) – Save this dataset as the new full dataset, else use the current full datset.

  • engine (str, optional) – Engine to use to save and load datasets.

delete_ds(backup=False)[source]#

Delete the on-disk dataset, optionally backing it up first.

add_ds(new_ds, sync=True, overwrite=None, chunks=None, engine=None)[source]#

Merge a new dataset into the in-memory full dataset.

Parameters:
  • new_ds (xr.Dataset or xr.DataArray) – Data to be merged into the full dataset.

  • sync (bool, optional) – If True (default), load and save the disk dataset before and after merging in the new data.

  • overwrite ({None, False, True}, optional) –

    How to combine data from the new run into the current full_ds:

    • None (default): attempt the merge and only raise if data conflicts.

    • True: overwrite conflicting current data with that from the new dataset.

    • False: drop any conflicting data from the new dataset.

  • chunks (int or dict, optional) – If not None, passed to xarray so that the full dataset is loaded and merged into with on-disk dask arrays.

  • engine (str, optional) – Engine to use to save and load datasets.

expand_dims(name, value, engine=None)[source]#

Add a new coordinate dimension with name and value. The change is immediately synced with the on-disk dataset. Useful if you want to expand the parameter space along a previously constant argument.

drop_sel(labels=None, *, errors='raise', engine=None, **labels_kwargs)[source]#

Drop specific values of coordinates from this harvester and its dataset. See http://xarray.pydata.org/en/latest/generated/xarray.Dataset.drop_sel.html. The change is immediately synced with the on-disk dataset. Useful for tidying uneeded data points.

harvest_combos(combos, *, sync=True, overwrite=None, chunks=None, engine=None, **runner_settings)[source]#

Run combos, automatically merging into an on-disk dataset.

Parameters:
  • combos (dict_like[str, iterable]) – The combos to run. The only difference here is that you can supply an ellipse ..., meaning the all values for that coordinate will be loaded from the current full dataset.

  • sync (bool, optional) – If True (default), load and save the disk dataset before and after merging in the new data.

  • overwrite ({None, False, True}, optional) –

    • None (default): attempt the merge and only raise if data conflicts.

    • True: overwrite conflicting current data with that from the new dataset.

    • False: drop any conflicting data from the new dataset.

  • chunks (bool, optional) – If not None, passed passed to xarray so that the full dataset is loaded and merged into with on-disk dask arrays.

  • engine (str, optional) – Engine to use to save and load datasets.

  • runner_settings – Supplied to combo_runner().

harvest_cases(cases, *, sync=True, overwrite=None, chunks=None, engine=None, **runner_settings)[source]#

Run cases, automatically merging into an on-disk dataset.

Parameters:
  • cases (list of dict or tuple) – The cases to run.

  • sync (bool, optional) – If True (default), load and save the disk dataset before and after merging in the new data.

  • overwrite ({None, False, True}, optional) –

    What to do regarding clashes with old data:

    • None (default): attempt the merge and only raise if data conflicts.

    • True: overwrite conflicting current data with that from the new dataset.

    • False: drop any conflicting data from the new dataset.

  • chunks (bool, optional) – If not None, passed passed to xarray so that the full dataset is loaded and merged into with on-disk dask arrays.

  • engine (str, optional) – Engine to use to save and load datasets.

  • runner_settings – Supplied to case_runner().

Crop(name=None, parent_dir=None, save_fn=None, batchsize=None, num_batches=None)[source]#

Return a Crop instance with this Harvester, from which fn will be set, and then combos can be sown, grown, and reaped into the Harvester.full_ds. See Crop.

Return type:

Crop

__repr__()[source]#

Return repr(self).

class xyzpy.gen.farming.Sampler(runner, data_name=None, default_combos=None, full_df=None, engine='pickle')[source]#

Like a Harvester, but randomly samples combos and writes the table of results to a pandas.DataFrame.

Parameters:
  • runner (xyzpy.Runner) – Runner describing a labelled function to run.

  • data_name (str, optional) – If given, the on-disk file to sync results with.

  • default_combos (dict_like[str, iterable], optional) – The default combos to sample from (which can be overridden).

  • full_df (pandas.DataFrame, optional) – If given, use this dataframe as the initial ‘full’ data.

  • engine ({'pickle', 'csv', 'json', 'hdf', ...}, optional) – How to save and load the on-disk dataframe. See load_df() and save_df().

full_df#

Dataframe describing all data harvested so far.

Type:

pandas.DataFrame

last_df#

Dataframe describing the data harvested on the previous run.

Type:

pandas.Dataframe

property fn#
property full_df#

The dataframe describing all data harvested so far.

property last_df#

The dataframe describing the last set of data harvested.

load_full_df(engine=None)[source]#

Load the on-disk full dataframe into memory.

save_full_df(new_full_df=None, engine=None)[source]#

Save full_df onto disk.

Parameters:
  • new_full_df (pandas.DataFrame, optional) – Save this dataframe as the new full dataframe, else use the current full_df.

  • engine (str, optional) – Which engine to save the dataframe with, if None use the default.

delete_df(backup=False)[source]#

Delete the on-disk dataframe, optionally backing it up first.

add_df(new_df, sync=True, engine=None)[source]#

Merge a new dataset into the in-memory full dataset.

Parameters:
  • new_df (pandas.DataFrame or dict) – Data to be appended to the full dataset.

  • sync (bool, optional) – If True (default), load and save the disk dataframe before and after merging in the new data.

  • engine (str, optional) – Which engine to save the dataframe with.

gen_cases_fnargs(n, combos=None)[source]#
sample_combos(n, combos=None, engine=None, **case_runner_settings)[source]#

Sample the target function many times, randomly choosing parameter combinations from combos (or SampleHarvester.default_combos).

Parameters:
  • n (int) – How many samples to run.

  • combos (dict_like[str, iterable], optional) – A mapping of function arguments to potential choices. Any keys in here will override default_combos. You can also suppply a callable to manually return a random choice e.g. from a probability distribution.

  • engine (str, optional) – Which method to use to sync with the on-disk dataframe.

  • case_runner_settings – Supplied to case_runner() and so onto combo_runner(). This includes parallel=True etc.

Crop(name=None, parent_dir=None, save_fn=None, batchsize=None, num_batches=None)[source]#

Return a Crop instance with this Sampler, from which fn will be set, and then samples can be sown, grown, and reaped into the Sampler.full_df. See Crop.

Return type:

Crop

__repr__()[source]#

Return repr(self).