xyzpy.gen.farming¶
Objects for labelling and succesively running functions.
Classes¶
Functions¶
Module Contents¶
- class xyzpy.gen.farming.Runner(fn, var_names, fn_args=None, var_dims=None, var_coords=None, constants=None, resources=None, attrs=None, **default_runner_settings)[source]¶
Bases:
objectContainer class with all the information needed to systematically run a function over many parameters and capture the output in a dataset.
- Parameters:
fn (callable) – Function that produces a single instance of a result.
var_names (str, sequence of str, or None) – The ordered name(s) of the ouput variable(s) of fn. Set this explicitly to None if fn outputs already labelled data as a dict,
Dataset, orDataArray.fn_args (str, or sequence of str, optional) – The ordered name(s) of the input arguments(s) of fn. This is only needed if the cases or combos supplied are not dict-like.
var_dims (dict-like, optional) – Mapping of output variables to their named internal dimensions, can be the names of
constants.var_coords (dict-like, optional) – Mapping of output variables named internal dimensions to the actual values they take.
constants (dict-like, optional) – Constants arguments to be supplied to fn. These can be used as ‘var_dims’, and will be saved as coords if so, otherwise as attributes.
resources (dict-like, optional) – Like constants but not saved to the the dataset, e.g. if very big.
attrs (dict-like, optional) – Any other miscelleous information to be saved with the dataset.
default_runner_settings – These keyword arguments will be supplied as defaults to any runner.
- fn¶
- _var_names = (None,)¶
- _fn_args¶
- _var_dims¶
- _var_coords¶
- _constants¶
- _resources¶
- _attrs¶
- _last_ds = None¶
- default_runner_settings¶
- fn_args¶
- var_names¶
- var_dims¶
- var_coords¶
- constants¶
- resources¶
- property last_ds¶
- run_combos(combos, constants=(), **runner_settings)[source]¶
Run combos using the function map and save to dataset.
- Parameters:
combos (dict_like[str, iterable]) – The values of each function argument with which to evaluate all combinations.
constants (dict, optional) – Extra constant arguments for this run, repeated arguments will take precedence over stored constants but for this run only.
runner_settings – Keyword arguments supplied to
combo_runner().
- run_cases(cases, constants=(), fn_args=None, **runner_settings)[source]¶
Run cases using the function and save to dataset.
- Parameters:
cases (sequence of mappings or tuples) – A sequence of cases.
constants (dict (optional)) – Extra constant arguments for this run, repeated arguments will take precedence over stored constants but for this run only.
runner_settings – Supplied to
case_runner().
- xyzpy.gen.farming.label(var_names, fn_args=None, var_dims=None, var_coords=None, constants=None, resources=None, attrs=None, harvester=False, sampler=False, engine=None, **default_runner_settings)[source]¶
Convenient decorator to automatically wrap a function as a
RunnerorHarvester.- Parameters:
var_names (str, sequence of str, or None) – The ordered name(s) of the ouput variable(s) of fn. Set this explicitly to None if fn outputs already labelled data as a dict,
Dataset, orDataArray.fn_args (str, or sequence of str, optional) – The ordered name(s) of the input arguments(s) of fn. This is only needed if the cases or combos supplied are not dict-like.
var_dims (dict-like, optional) – Mapping of output variables to their named internal dimensions, can be the names of
constants.var_coords (dict-like, optional) – Mapping of output variables named internal dimensions to the actual values they take.
constants (dict-like, optional) – Constants arguments to be supplied to fn. These can be used as ‘var_dims’, and will be saved as coords if so, otherwise as attributes.
resources (dict-like, optional) – Like constants but not saved to the the dataset, e.g. if very big.
attrs (dict-like, optional) – Any other miscelleous information to be saved with the dataset.
harvester (bool or str, optional) – If
True, wrap the runner as aHarvester, if a string, create the harvester with that as thedata_name.default_runner_settings – These keyword arguments will be supplied as defaults to any runner.
Examples
Declare a function as a runner directly:
>>> import xyzpy as xyz >>> @xyz.label(var_names=['sum', 'diff']) ... def foo(x, y): ... return x + y, x - y ... >>> foo <xyzpy.Runner> fn: <function foo at 0x7f1fd8e5b1e0> fn_args: ('x', 'y') var_names: ('sum', 'diff') var_dims: {'sum': (), 'diff': ()} >>> foo(1, 2) # can still call it normally (3, -1)
- class xyzpy.gen.farming.Harvester(runner: Runner, data_name=None, chunks=None, engine='h5netcdf', full_ds=None)[source]¶
Bases:
objectContainer class for collecting and aggregating data to disk.
- Parameters:
runner (Runner) – Performs the runs and describes the results.
data_name (str, optional) – Base file path to save data to.
chunks (int or dict, optional) – If not None, passed to xarray so that the full dataset is loaded and merged into with on-disk dask arrays.
engine (str, optional) – Engine to use to save and load datasets.
full_ds (xarray.Dataset) – Initialize the Harvester with this dataset as the intitial full dataset.
Members
-------
full_ds – Dataset containing all data harvested so far, by default synced to disk.
last_ds (xarray.Dataset) – Dataset containing just the data from the last harvesting run.
- runner¶
- data_name = None¶
- engine = 'h5netcdf'¶
- chunks = None¶
- _full_ds = None¶
- property fn¶
- property last_ds¶
Dataset containing the last runs’ data.
- property full_ds¶
Dataset containing all saved runs.
- save_full_ds(new_full_ds=None, engine=None)[source]¶
Save full_ds onto disk. The old file is moved and kept as a backup in case of errors when writing the new dataset to disk.
- Parameters:
new_full_ds (xarray.Dataset, optional) – Save this dataset as the new full dataset, else use the current full datset.
engine (str, optional) – Engine to use to save and load datasets.
- add_ds(new_ds, sync=True, overwrite=None, chunks=None, engine=None)[source]¶
Merge a new dataset into the in-memory full dataset.
- Parameters:
new_ds (xr.Dataset or xr.DataArray) – Data to be merged into the full dataset.
sync (bool, optional) – If True (default), load and save the disk dataset before and after merging in the new data.
overwrite ({None, False, True}, optional) –
How to combine data from the new run into the current full_ds:
None(default): attempt the merge and only raise if data conflicts.True: overwrite conflicting current data with that from the new dataset.False: drop any conflicting data from the new dataset.
chunks (int or dict, optional) – If not None, passed to xarray so that the full dataset is loaded and merged into with on-disk dask arrays.
engine (str, optional) – Engine to use to save and load datasets.
- expand_dims(name, value, engine=None)[source]¶
Add a new coordinate dimension with
nameandvalue. The change is immediately synced with the on-disk dataset. Useful if you want to expand the parameter space along a previously constant argument.
- drop_sel(labels=None, *, errors='raise', engine=None, **labels_kwargs)[source]¶
Drop specific values of coordinates from this harvester and its dataset. See http://xarray.pydata.org/en/latest/generated/xarray.Dataset.drop_sel.html. The change is immediately synced with the on-disk dataset. Useful for tidying uneeded data points.
- _maybe_expand_combos(combos)[source]¶
Expand combos with ellipses into full coordinate values from the current full dataset.
- harvest_combos(combos, *, cases=None, missing_only=False, sync=True, overwrite=None, chunks=None, engine=None, **runner_settings)[source]¶
Run combos, automatically merging into an on-disk dataset.
- Parameters:
combos (dict_like[str, iterable]) – The combos to run. The only difference here is that you can supply an ellipse
..., meaning the all values for that coordinate will be loaded from the current full dataset.missing_only (bool, optional) – If True, only run combos that are not already present in the on-disk dataset.
sync (bool, optional) – If True (default), load and save the disk dataset before and after merging in the new data.
overwrite ({None, False, True}, optional) –
None(default): attempt the merge and only raise if data conflicts.True: overwrite any conflicting current data with that from the new dataset.False: drop any conflicting data from the new dataset.
chunks (bool, optional) – If not None, passed passed to xarray so that the full dataset is loaded and merged into with on-disk dask arrays.
engine (str, optional) – Engine to use to save and load datasets.
runner_settings – Supplied to
combo_runner().
- harvest_cases(cases, *, sync=True, overwrite=None, chunks=None, engine=None, **runner_settings)[source]¶
Run cases, automatically merging into an on-disk dataset.
- Parameters:
sync (bool, optional) – If True (default), load and save the disk dataset before and after merging in the new data.
overwrite ({None, False, True}, optional) –
What to do regarding clashes with old data:
None(default): attempt the merge and only raise if data conflicts.True: overwrite conflicting current data with that from the new dataset.False: drop any conflicting data from the new dataset.
chunks (bool, optional) – If not None, passed passed to xarray so that the full dataset is loaded and merged into with on-disk dask arrays.
engine (str, optional) – Engine to use to save and load datasets.
runner_settings – Supplied to
case_runner().
- Crop(name=None, parent_dir=None, save_fn=None, batchsize=None, num_batches=None)[source]¶
Return a Crop instance with this Harvester, from which fn will be set, and then combos can be sown, grown, and reaped into the
Harvester.full_ds. SeeCrop.- Return type:
- cultivate(combos=None, cases=None, constants=None, name=None, parent_dir=None, batchsize=None, num_batches=None, missing_only=True, shuffle=True, subprocess='auto', num_workers=None, num_threads=None, gpus=None, affinities=None, log=False, raise_errors=True, verbosity=1, on_existing='ask', on_error='ask', clean_up=None, **grow_kwargs)[source]¶
Convenience method to run a full cycle of parsing combos into missing cases only, then persistently growing those cases, and finally merging the results into the full dataset.
- Parameters:
combos (dict_like[str, iterable]) – The combos to run. The only difference here is that you can supply an ellipse
..., meaning the all values for that coordinate will be loaded from the current full dataset.cases (sequence of mappings or tuples, optional) – A sequence of (partial) individual settings to run. For each case, all settings given by combos will be generated.
constants (dict, optional) – Extra constant arguments for this run.
name (str, optional) – Name for the crop to be used for on-disk storage of batches, results and logs. You can use different names to grow results for the same dataset concurrently.
parent_dir (str, optional) – Parent directory in which to create the crop folder (
.xyz-{name}/). Defaults to the current working directory.batchsize (int, optional) – If given, the target number of cases to sow in each batch. This is computed from
num_batchesif not given and 1 if neither given.num_batches (int, optional) – If given, the target number of batches to sow. This is computed from
batchsizeif not given and 1 if neither given.missing_only (bool, optional) – If True (default), only run cases that are not already present in the on-disk dataset. If False, the new results will overwrite any existing results.
shuffle (bool, optional) – If True (default), shuffle the order of cases before sowing and growing. This can be a useful basic form of load balancing.
subprocess ("auto" or bool, optional) – Whether to grow each batch in a fresh subprocess. This adds about 1 second of overhead per batch, but allows the number of threads, cpu affinity and gpu assignment to be controlled. If “auto” (default) subprocesses are used when
num_threads,gpusoraffinitiesare specified. Seexyzpy.Crop.grow()for details.num_workers (int, optional) – Maximum number of batches to run concurrently. In subprocess mode this caps simultaneous subprocesses (defaults to 1 if not given). In in-process mode this is the joblib loky pool size (
None= serial). Forwarded toxyzpy.Crop.grow().num_threads (int, optional) – Number of threads each worker is allowed to use, applied via the standard env vars (
OMP_NUM_THREADS,MKL_NUM_THREADS, etc.) in each subprocess. Impliessubprocess=Truewhensubprocess="auto". Forwarded toxyzpy.Crop.grow().gpus (int, str, or sequence of int, optional) – GPU device IDs to assign to subprocesses via
CUDA_VISIBLE_DEVICES; the pool also caps concurrency. Impliessubprocess=Truewhensubprocess="auto". Forwarded toxyzpy.Crop.grow().affinities (int, str, or sequence of int, optional) – CPU core IDs to pin subprocesses to via
taskset; the pool also caps concurrency. Impliessubprocess=Truewhensubprocess="auto". Forwarded toxyzpy.Crop.grow().log (bool, optional) – Whether to save subprocess stdout and stderr to files in the crop directory under
logs/batch-{batch_id}.log. Subprocess-mode only. Forwarded toxyzpy.Crop.grow().raise_errors (bool, optional) – If True (default), raise any errors that occur during growing, otherwise just log them and continue with the next batch.
verbosity (int, optional) – The level of logging to print during the sow/grow/reap process. 0: no output, 1: progress bars, 2: progress bars with current setting postfixed.
on_existing ({'ask', 'reap', 'delete', 'skip', 'raise'}, optional) – What to do if a crop with the same name already exists on disk. Default is
'ask'(interactive prompt).on_error ({'ask', 'reap', 'delete', 'skip', 'raise'}, optional) – What to do if an error occurs during grow/reap. Default is
'ask'(interactive prompt).clean_up (bool or None, optional) – Whether to delete the on-disk batch, result and log files after successfully reaping.
grow_kwargs – Further keyword arguments forwarded to
xyzpy.Crop.grow()(e.g.executor,min_wait, …).
See also
- xyzpy.gen.farming.cultivate(fn, *, var_names=None, data_name=None, runner_opts=None, harvester_opts=None, combos=None, cases=None, constants=None, name=None, parent_dir=None, batchsize=None, num_batches=None, missing_only=True, shuffle=True, subprocess='auto', num_workers=None, num_threads=None, gpus=None, affinities=None, log=False, raise_errors=True, verbosity=1, on_existing='ask', on_error='ask', clean_up=None, **grow_kwargs)[source]¶
Convenience function to run a full cycle of annotating a function, parsing combos into missing cases only, then persistently growing those cases, and finally merging the results into the full dataset.
- Parameters:
fn (callable) – The function to run over combos and cases. This will be wrapped in a
RunnerandHarvesterto perform the cultivation process. If var_names is None, it should return a dict,DatasetorDataArray.var_names (str, sequence of str, or None) – The ordered name(s) of the ouput variable(s) of fn. Set this explicitly to None if fn outputs already labelled data as a dict,
Dataset, orDataArray.data_name (str, optional) – If given, the on-disk file to sync results with. If not set there will be no persistent results, since the harvester created in this functional interface is ephemeral.
runner_opts (dict, optional) – Keyword arguments to be supplied to
Runner.harvester_opts (dict, optional) – Keyword arguments to be supplied to
Harvester.combos (dict_like[str, iterable]) – The combos to run. The only difference here is that you can supply an ellipse
..., meaning the all values for that coordinate will be loaded from the current full dataset.cases (sequence of mappings or tuples, optional) – A sequence of (partial) individual settings to run. For each case, all settings given by combos will be generated.
constants (dict, optional) – Extra constant arguments for this run.
name (str, optional) – Name for the crop to be used for on-disk storage of batches, results and logs. You can use different names to grow results for the same dataset concurrently.
parent_dir (str, optional) – Parent directory in which to create the crop folder (
.xyz-{name}/). Defaults to the current working directory.batchsize (int, optional) – If given, the target number of cases to sow in each batch. This is computed from
num_batchesif not given and 1 if neither given.num_batches (int, optional) – If given, the target number of batches to sow. This is computed from
batchsizeif not given and 1 if neither given.missing_only (bool, optional) – If True (default), only run cases that are not already present in the on-disk dataset
shuffle (bool, optional) – If True (default), shuffle the order of cases before sowing and growing. This can be a useful basic form of load balancing.
subprocess ("auto" or bool, optional) – Whether to grow each batch in a fresh subprocess. This adds about 1 second of overhead per batch, but allows the number of threads, cpu affinity and gpu assignment to be controlled. If “auto” (default) subprocesses are used when
num_threads,gpusoraffinitiesare specified. Seexyzpy.Crop.grow()for details.num_workers (int, optional) – Maximum number of batches to run concurrently. In subprocess mode this caps simultaneous subprocesses (defaults to 1 if not given). In in-process mode this is the joblib loky pool size (
None= serial). Forwarded toxyzpy.Crop.grow().num_threads (int, optional) – Number of threads each worker is allowed to use, applied via the standard env vars (
OMP_NUM_THREADS,MKL_NUM_THREADS, etc.) in each subprocess. Impliessubprocess=Truewhensubprocess="auto". Forwarded toxyzpy.Crop.grow().gpus (int, str, or sequence of int, optional) – GPU device IDs to assign to subprocesses via
CUDA_VISIBLE_DEVICES; the pool also caps concurrency. Impliessubprocess=Truewhensubprocess="auto". Forwarded toxyzpy.Crop.grow().affinities (int, str, or sequence of int, optional) – CPU core IDs to pin subprocesses to via
taskset; the pool also caps concurrency. Impliessubprocess=Truewhensubprocess="auto". Forwarded toxyzpy.Crop.grow().log (bool, optional) – Whether to save subprocess stdout and stderr to files in the crop directory under
logs/batch-{batch_id}.log. Subprocess-mode only. Forwarded toxyzpy.Crop.grow().raise_errors (bool, optional) – If True (default), raise any errors that occur during growing, otherwise just log them and continue with the next batch.
verbosity (int, optional) – The level of logging to print during the sow/grow/reap process. 0: no output, 1: progress bars, 2: progress bars with current setting postfixed.
on_existing ({'ask', 'reap', 'delete', 'skip', 'raise'}, optional) – What to do if a crop with the same name already exists on disk. Default is
'ask'(interactive prompt).on_error ({'ask', 'reap', 'delete', 'skip', 'raise'}, optional) – What to do if an error occurs during grow/reap. Default is
'ask'(interactive prompt).clean_up (bool or None, optional) – Whether to delete the on-disk batch, result and log files after successfully reaping.
grow_kwargs – Further keyword arguments forwarded to
xyzpy.Crop.grow()(e.g.executor,min_wait, …).
See also
- class xyzpy.gen.farming.Sampler(runner, data_name=None, default_combos=None, full_df=None, engine='pickle')[source]¶
Like a Harvester, but randomly samples combos and writes the table of results to a
pandas.DataFrame.- Parameters:
runner (xyzpy.Runner) – Runner describing a labelled function to run.
data_name (str, optional) – If given, the on-disk file to sync results with.
default_combos (dict_like[str, iterable], optional) – The default combos to sample from (which can be overridden).
full_df (pandas.DataFrame, optional) – If given, use this dataframe as the initial ‘full’ data.
engine ({'pickle', 'csv', 'json', 'hdf', ...}, optional) – How to save and load the on-disk dataframe. See
load_df()andsave_df().
- full_df¶
Dataframe describing all data harvested so far.
- Type:
- last_df¶
Dataframe describing the data harvested on the previous run.
- Type:
pandas.Dataframe
- runner¶
- data_name = None¶
- default_combos¶
- _full_df = None¶
- _last_df = None¶
- engine = 'pickle'¶
- property fn¶
- property full_df¶
The dataframe describing all data harvested so far.
- property last_df¶
The dataframe describing the last set of data harvested.
- save_full_df(new_full_df=None, engine=None)[source]¶
Save full_df onto disk.
- Parameters:
new_full_df (pandas.DataFrame, optional) – Save this dataframe as the new full dataframe, else use the current
full_df.engine (str, optional) – Which engine to save the dataframe with, if None use the default.
- add_df(new_df, sync=True, engine=None)[source]¶
Merge a new dataset into the in-memory full dataset.
- Parameters:
new_df (pandas.DataFrame or dict) – Data to be appended to the full dataset.
sync (bool, optional) – If True (default), load and save the disk dataframe before and after merging in the new data.
engine (str, optional) – Which engine to save the dataframe with.
- sample_combos(n, combos=None, engine=None, **case_runner_settings)[source]¶
Sample the target function many times, randomly choosing parameter combinations from
combos(orSampleHarvester.default_combos).- Parameters:
n (int) – How many samples to run.
combos (dict_like[str, iterable], optional) – A mapping of function arguments to potential choices. Any keys in here will override
default_combos. You can also suppply a callable to manually return a random choice e.g. from a probability distribution.engine (str, optional) – Which method to use to sync with the on-disk dataframe.
case_runner_settings – Supplied to
case_runner()and so ontocombo_runner(). This includesparallel=Trueetc.