xyzpy

Submodules

Attributes

Classes

Crop

Encapsulates all the details describing a single 'crop', that is,

Harvester

Container class for collecting and aggregating data to disk.

Runner

Container class with all the information needed to systematically

Sampler

Like a Harvester, but randomly samples combos and writes the table of

RayExecutor

Basic concurrent.futures like interface using ray.

RayGPUExecutor

A RayExecutor that by default requests a single gpu per task.

AutoHeatMap

AutoHistogram

AutoLinePlot

AutoScatter

HeatMap

Histogram

LinePlot

Scatter

Benchmarker

Compare the performance of various kernels. Internally this makes

MemoryMonitor

Monitor this process' peak memory usage with specified sampling interval

RunningCovariance

Running covariance class.

RunningCovarianceMatrix

Running covariance matrix for n variables.

RunningStatistics

Running mean & standard deviation using Welford's

Timer

A very simple context manager class for timing blocks.

Functions

case_runner(fn, fn_args, cases[, combos, constants, ...])

Simple case runner that outputs the raw tuple of results.

case_runner_to_ds(fn, fn_args, cases, var_names[, ...])

Takes a list of cases to run fn over, possibly in parallel, and

find_missing_cases(ds[, ignore_dims, method])

Find all cases in a dataset or DataArray with missing data.

is_case_missing(ds, setting[, method])

Does the dataset or dataarray ds not contain any non-null data for

parse_into_cases([combos, cases, ds, method])

Convert maybe combos and maybe cases to a single list of

combo_runner(fn[, combos, cases, constants, split, ...])

Take a function fn and compute it over all combinations of named

combo_runner_to_ds(fn, combos, var_names, *[, ...])

Evaluate a function over all cases and combinations and output to a

clean_slurm_outputs(job[, directory, cancel_if_finished])

grow(batch_number[, crop, fn, num_workers, check_mpi, ...])

Automatically process a batch of cases into results. Should be run in an

load_crops([directory])

Automatically load all the crops found in the current directory.

manage_slurm_outputs(crop, job[, wait_time])

cultivate(fn, *[, var_names, data_name, runner_opts, ...])

Convenience function to run a full cycle of annotating a function,

label(var_names[, fn_args, var_dims, var_coords, ...])

Convenient decorator to automatically wrap a function as a

auto_xyz_ds(x[, y_z])

Automatically turn an array into a xarray dataset. Transpose y_z

cache_to_disk([fn, cachedir])

Cache this function to disk, using joblib.

check_runs(obj[, dim, var, sel])

Print out information about the range and any missing values for an

load_df(name[, engine, key])

Load a dataframe from disk.

load_ds(file_name[, engine, load_to_mem, create_new, ...])

Loads a xarray dataset. Basically xarray.open_dataset with some

merge_sync_conflict_datasets(base_name[, engine, ...])

Glob files based on base_name, merge them, save this new dataset if

save_df(df, name[, engine, key])

Save a dataframe to disk.

save_ds(ds, file_name[, engine])

Saves a xarray dataset.

save_merge_ds(ds, fname[, overwrite])

Save dataset ds, but check for an existing dataset with that name

sort_dims(ds)

Reorder variable dimensions to match ds.dims. This is an inplace

trimna(obj)

Drop values across dims where all values are NaN.

cimluv(hue[, hue_shift, sat1, sat2, val1, val2, N, ...])

Creates a color map for single hue, using HSLuv color space.

cimple(hue[, sat1, sat2, val1, val2, hue_shift, name, ...])

Creates a color map for a single hue.

cimple_bright(hue[, sat1, sat2, val1, val2, ...])

Creates a color map for a single hue, with bright defaults.

cmoke(hue[, hue_shift, sat1, sat2, val1, val2, N, reverse])

Creates a color map for single hue, using OKLCH color space.

convert_colors(cols, outformat[, informat])

Convert lists of colors between formats

get_neutral_style([draw_color])

infiniplot(ds, x[, y, z])

Helper class for the infiniplot functionality.

neutral_style([draw_color])

auto_iheatmap(x, **iheatmap_opts)

Auto version of iheatmap() that accepts array arguments

auto_ilineplot(x, y_z, **lineplot_opts)

Auto version of ilineplot() that accepts array arguments

auto_iscatter(x, y_z, **iscatter_opts)

Auto version of iscatter() that accepts array arguments

iheatmap(ds, x, y, z, **kwargs)

From ds plot variable z as a function of x and y using

ilineplot(ds, x, y[, z, y_err, x_err])

From ds plot lines of y as a function of x, optionally for

iscatter(ds, x, y[, z, y_err, x_err])

From ds plot a scatter of y against x, optionally for

auto_heatmap(x, **heatmap_opts)

Auto version of heatmap() that accepts array arguments

auto_histogram(x, **histogram_opts)

Auto version of histogram() that accepts array arguments

auto_lineplot(x, y_z, **lineplot_opts)

Auto version of lineplot() that accepts array arguments

auto_scatter(x, y_z, **scatter_opts)

Auto version of scatter() that accepts array arguments

heatmap(ds, x, y, z, **kwargs)

From ds plot variable z as a function of x and y using

histogram(ds, x[, z])

Dataset histogram.

lineplot(ds, x, y[, z, y_err, x_err])

From ds plot lines of y as a function of x, optionally for

scatter(ds, x, y[, z, y_err, x_err])

From ds plot a scatter of y against x, optionally for

visualize_matrix(array[, max_mag, magscale, ...])

Visualize array as a 2D colormapped image.

visualize_tensor(array[, spacing_factor, ...])

Visualize all entries of a tensor, with indices mapped into the plane

benchmark(fn[, setup, n, min_t, repeats, get, starmap])

Benchmark the time it takes to run fn.

estimate_from_repeats(fn, *fn_args[, rtol, tol_scale, ...])

format_number_with_error(x, err)

Given x with error err, format a string showing the relevant

get_peak_memory_usage()

Get the peak memory usage of the current process in gigabytes. This

getsizeof(obj)

Compute the real size of a Python object in bytes, taken from

progbar([it, nb])

Turn any iterable into a progress bar, with notebook option

report_memory()

Return a formatted memory usage summary for the current process.

report_memory_gpu()

Return a formatted GPU memory usage summary for the process.

unzip(its[, zip_level])

Split a nested iterable at a specified level, i.e. in numpy language

Package Contents

xyzpy.case_runner(fn, fn_args, cases, combos=None, constants=None, split=False, shuffle=False, parse=True, parallel=False, executor=None, num_workers=None, verbosity=1)[source]

Simple case runner that outputs the raw tuple of results.

Parameters:
  • fn (callable) – Function with which to evalute cases with

  • fn_args (tuple) – Names of case arguments that fn takes, can be None if each case is a dict.

  • cases (iterable[tuple] or iterable[dict]) – List of specific configurations that fn_args should take. If fn_args is None, each case should be a dict.

  • combos (dict_like[str, iterable], optional) – Optional specification of sub-combinations.

  • constants (dict, optional) – Constant function arguments.

  • split (bool, optional) – See combo_runner().

  • shuffle (bool or int, optional) – If given, compute the results in a random order (using random.seed and random.shuffle), which can be helpful for distributing resources when not all cases are computationally equal.

  • parallel (bool, optional) – Process combos in parallel, default number of workers picked.

  • executor (executor-like pool, optional) – Submit all combos to this pool executor. Must have submit or apply_async methods and API matching either concurrent.futures or an ipyparallel view. Pools from multiprocessing.pool are also supported.

  • num_workers (int, optional) – Explicitly choose how many workers to use, None for automatic.

  • verbosity ({0, 1, 2}, optional) –

    How much information to display:

    • 0: nothing,

    • 1: just progress,

    • 2: all information.

Returns:

results

Return type:

list of fn output for each case

xyzpy.case_runner_to_df
xyzpy.case_runner_to_ds(fn, fn_args, cases, var_names, var_dims=None, var_coords=None, combos=None, constants=None, resources=None, attrs=None, shuffle=False, to_df=False, parse=True, parallel=False, num_workers=None, executor=None, verbosity=1)[source]

Takes a list of cases to run fn over, possibly in parallel, and outputs a xarray.Dataset.

Parameters:
  • fn (callable) – Function to evaluate.

  • fn_args (str or iterable[str]) – Names and order of arguments to fn, can be None if cases are supplied as dicts.

  • cases (iterable[tuple] or iterable[dict]) – List of configurations used to generate results.

  • var_names (str or iterable of str) – Variable name(s) of the output(s) of fn.

  • var_dims (sequence of either strings or string sequences, optional) – ‘Internal’ names of dimensions for each variable, the values for each dimension should be contained as a mapping in either var_coords (not needed by fn) or constants (needed by fn).

  • var_coords (mapping, optional) – Mapping of extra coords the output variables may depend on.

  • combos (dict_like[str, iterable], optional) – If specified, run all combinations of some arguments in these mappings.

  • constants (mapping, optional) – Arguments to fn which are not iterated over, these will be recorded either as attributes or coordinates if they are named in var_dims.

  • resources (mapping, optional) – Like constants but they will not be recorded.

  • attrs (mapping, optional) – Any extra attributes to store.

  • shuffle (bool or int, optional) – If given, compute the results in a random order (using random.seed and random.shuffle), which can be helpful for distributing resources when not all cases are computationally equal.

  • parse (bool, optional) – Whether to perform parsing of the inputs arguments.

  • parallel (bool, optional) – Process combos in parallel, default number of workers picked.

  • executor (executor-like pool, optional) – Submit all combos to this pool executor. Must have submit or apply_async methods and API matching either concurrent.futures or an ipyparallel view. Pools from multiprocessing.pool are also supported.

  • num_workers (int, optional) – Explicitly choose how many workers to use, None for automatic.

  • verbosity ({0, 1, 2}, optional) –

    How much information to display:

    • 0: nothing,

    • 1: just progress,

    • 2: all information.

Returns:

ds – Dataset with minimal covering coordinates and all cases evaluated.

Return type:

xarray.Dataset

xyzpy.find_missing_cases(ds, ignore_dims=None, method='isnull')[source]

Find all cases in a dataset or DataArray with missing data.

Parameters:
  • ds (xarray.Dataset or xarray.DataArray) – Dataset or DataArray in which to find missing data

  • ignore_dims (set, optional) – Internal variable dimensions (i.e. to ignore). By default (None) this is set to any dimensions that don’t appear on all variables.

Returns:

cases_missing – List of cases with missing data, where each case is a dict mapping from dimension name to coordinate value.

Return type:

iterable[dict]

xyzpy.is_case_missing(ds, setting, method='isnull')[source]

Does the dataset or dataarray ds not contain any non-null data for single location setting?

Note that this only returns true if all data across all variables is completely missing at the location.

Parameters:
Returns:

missing

Return type:

bool

xyzpy.parse_into_cases(combos=None, cases=None, ds=None, method='isnull')[source]

Convert maybe combos and maybe cases to a single list of cases only, also optionally filtering based on whether any data at each location is already present in Dataset or DataArray ds.

Note that this only checks whether all data across all variables is completely missing at the location. To check against a single variable only simply supply a DataArray instead of a Dataset, e.g. ds=ds["var_name"].

Parameters:
  • combos (dict_like[str, iterable], optional) – Parameter combinations.

  • cases (iterable[dict], optional) – Parameter configurations.

  • ds (xarray.Dataset or xarray.DataArray, optional) – Dataset or DataArray in which to check for existing data.

  • method ({"isnull", "isfinite"}, optional) – How to determine whether data is missing when ds is supplied. “isnull” checks for null/nan values, while “isfinite” checks for all non-finite values (i.e. inf or nan).

Returns:

new_cases – The combined and possibly filtered list of cases.

Return type:

iterable[dict]

xyzpy.combo_runner(fn, combos=None, *, cases=None, constants=None, split=False, flat=False, shuffle=False, parallel=False, executor=None, num_workers=None, verbosity=1, desc=None)[source]

Take a function fn and compute it over all combinations of named variables values, optionally showing progress and in parallel.

Parameters:
  • fn (callable) – Function to analyse.

  • combos (dict_like[str, iterable]) – All combinations of each argument to values mapping will be computed. Each argument range thus gets a dimension in the output array(s).

  • cases (sequence of mappings, optional) – Optional list of specific configurations. If both combos and cases are given, then the function is computed for all sub-combinations in combos for each case in cases, arguments can thus only appear in one or the other. Note that missing combinations of arguments will be represented by nan if creating a nested array.

  • constants (dict, optional) – Constant function arguments. Unlike combos and cases, these won’t produce dimensions in the output result when flat=False.

  • split (bool, optional) – Whether to split (unzip) the outputs of fn into multiple output arrays or not.

  • flat (bool, optional) – Whether to return a flat list of results or to return a nested tuple suitable to be supplied to numpy.array.

  • shuffle (bool or int, optional) – If given, compute the results in a random order (using random.seed and random.shuffle), which can be helpful for distributing resources when not all cases are computationally equal.

  • parallel (bool, optional) – Process combos in parallel, default number of workers picked.

  • executor (executor-like pool, optional) – Submit all combos to this pool executor. Must have submit or apply_async methods and API matching either concurrent.futures or an ipyparallel view. Pools from multiprocessing.pool are also supported.

  • num_workers (int, optional) – Explicitly choose how many workers to use, None for automatic.

  • verbosity ({0, 1, 2}, optional) –

    How much information to display:

    • 0: nothing,

    • 1: just progress,

    • 2: postfix the current settings to the progress bar.

  • desc (str, optional) – Description to show in the progress bar, if verbosity > 0.

Returns:

data – Nested tuple containing all combinations of running fn if flat == False else a flat list of results.

Return type:

nested tuple

Examples

>>> def fn(a, b, c, d):
...     return str(a) + str(b) + str(c) + str(d)

Run all possible combos:

>>> xyz.combo_runner(
...     fn,
...     combos={
...         'a': [1, 2],
...         'b': [3, 4],
...         'c': [5, 6],
...         'd': [7, 8],
...     },
... )
100%|##########| 16/16 [00:00<00:00, 84733.41it/s]

(((('1357', '1358'), ('1367', '1368')),
  (('1457', '1458'), ('1467', '1468'))),
 ((('2357', '2358'), ('2367', '2368')),
  (('2457', '2458'), ('2467', '2468'))))

Run only a selection of cases:

>>> xyz.combo_runner(
...     fn,
...     cases=[
...         {'a': 1, 'b': 3, 'c': 5, 'd': 7},
...         {'a': 2, 'b': 4, 'c': 6, 'd': 8},
...     ],
... )
100%|##########| 2/2 [00:00<00:00, 31418.01it/s]
(((('1357', nan), (nan, nan)),
  ((nan, nan), (nan, nan))),
 (((nan, nan), (nan, nan)),
  ((nan, nan), (nan, '2468'))))

Run only certain cases of some args, but all combinations of others:

>>> xyz.combo_runner(
...     fn,
...     cases=[
...         {'a': 1, 'b': 3},
...         {'a': 2, 'b': 4},
...     ],
...     combos={
...         'c': [3, 4],
...         'd': [4, 5],
...     },
... )
100%|##########| 8/8 [00:00<00:00, 92691.80it/s]
(((('1334', '1335'), ('1344', '1345')),
  ((nan, nan), (nan, nan))),
 (((nan, nan), (nan, nan)),
  (('2434', '2435'), ('2444', '2445'))))
xyzpy.combo_runner_to_df
xyzpy.combo_runner_to_ds(fn, combos, var_names, *, var_dims=None, var_coords=None, cases=None, constants=None, resources=None, attrs=None, shuffle=False, parse=True, to_df=False, parallel=False, num_workers=None, executor=None, verbosity=1, desc=None)[source]

Evaluate a function over all cases and combinations and output to a xarray.Dataset.

Parameters:
  • fn (callable) – Function to evaluate.

  • combos (dict_like[str, iterable]) – Mapping of each individual function argument to sequence of values.

  • var_names (str, sequence of strings, or None) – Variable name(s) of the output(s) of fn, set to None if fn outputs data already labelled in a Dataset or DataArray.

  • var_dims (sequence of either strings or string sequences, optional) – ‘Internal’ names of dimensions for each variable, the values for each dimension should be contained as a mapping in either var_coords (not needed by fn) or constants (needed by fn).

  • var_coords (mapping, optional) – Mapping of extra coords the output variables may depend on.

  • cases (sequence of dicts, optional) – Individual cases to run for some or all function arguments.

  • constants (mapping, optional) – Arguments to fn which are not iterated over, these will be recorded either as attributes or coordinates if they are named in var_dims.

  • resources (mapping, optional) – Like constants but they will not be recorded.

  • attrs (mapping, optional) – Any extra attributes to store.

  • parallel (bool, optional) – Process combos in parallel, default number of workers picked.

  • executor (executor-like pool, optional) – Submit all combos to this pool executor. Must have submit or apply_async methods and API matching either concurrent.futures or an ipyparallel view. Pools from multiprocessing.pool are also supported.

  • num_workers (int, optional) – Explicitly choose how many workers to use, None for automatic.

  • verbosity ({0, 1, 2}, optional) –

    How much information to display:

    • 0: nothing,

    • 1: just progress,

    • 2: postfix the current settings to the progress bar.

  • desc (str, optional) – Description to show in the progress bar, if verbosity > 0.

Returns:

ds – Multidimensional labelled dataset contatining all the results if to_df=False (the default), else a pandas dataframe with results as labelled rows.

Return type:

xarray.Dataset or pandas.DataFrame

class xyzpy.Crop(*, fn=None, name=None, parent_dir=None, save_fn=None, batchsize=None, num_batches=None, shuffle=False, farmer=None, autoload=True)[source]

Bases: object

Encapsulates all the details describing a single ‘crop’, that is, its location, name, and batch size/number. Also allows tracking of crop’s progress, and experimentally, automatic submission of workers to grid engine to complete un-grown cases. Can also be instantiated directly from a Runner or Harvester or Crop instance.

Parameters:
  • fn (callable, optional) – Target function - Crop name will be inferred from this if not given explicitly. If given, Sower will also default to saving a version of fn to disk for cropping.grow to use.

  • name (str, optional) – Custom name for this set of runs - must be given if fn is not.

  • parent_dir (str, optional) – If given, alternative directory to put the “.xyz-{name}/” folder in with all the cases and results.

  • save_fn (bool, optional) – Whether to save the function to disk for cropping.grow to use. Will default to True if fn is given.

  • batchsize (int, optional) – How many cases to group into a single batch per worker. By default, batchsize=1. Cannot be specified if num_batches is.

  • num_batches (int, optional) – How many total batches to aim for, cannot be specified if batchsize is.

  • farmer ({xyzpy.Runner, xyzpy.Harvester, xyzpy.Sampler}, optional) – A Runner, Harvester or Sampler, instance, from which the fn can be inferred and which can also allow the Crop to reap itself straight to a dataset or dataframe.

  • autoload (bool, optional) – If True, check for the existence of a Crop written to disk with the same location, and if found, load it.

name = None
parent_dir = None
save_fn = None
batchsize = None
num_batches = None
shuffle = False
_batch_remainder = None
_all_nan_result = None
_num_sown_batches = -1
_num_results = -1
property runner
choose_batch_settings(*, combos=None, cases=None)[source]

Work out how to divide all cases into batches, i.e. ensure that batchsize * num_batches >= num_cases.

ensure_dirs_exists()[source]

Make sure the directory structure for this crop exists.

save_info(combos=None, cases=None, fn_args=None)[source]

Save information about the sowed cases.

load_info()[source]

Load the full settings from disk.

load_batch(batch_number)[source]

Load a specific batch from disk.

load_result(batch_number)[source]

Load a specific result from disk.

save_result(batch_number, result)[source]

Save a specific result to disk.

_sync_info_from_disk(only_missing=True)[source]

Load information about the saved cases.

save_function_to_disk()[source]

Save the base function to disk using cloudpickle

load_function()[source]

Load the saved function from disk, and try to re-insert it back into Harvester or Runner if present.

prepare(combos=None, cases=None, fn_args=None)[source]

Write information about this crop and the supplied combos to disk. Typically done at start of sow, not when Crop instantiated.

is_prepared()[source]

Check whether this crop has been written to disk.

calc_progress()[source]

Calculate how much progressed has been made in growing the batches.

is_ready_to_reap()[source]

Have all batches been grown?

completed_results() tuple[int, Ellipsis][source]

Return tuple of batches which have been grown already.

missing_results() tuple[int, Ellipsis][source]

Return tuple of batches which haven’t been grown yet.

delete_all()[source]

Delete the crop directory and all its contents, and reset any loaded information on this Crop object.

handle_existing(action='ask', msg=None, e=None, overwrite=False)[source]

Handle an already prepared crop.

Parameters:
  • action ({'ask', 'reap', 'delete', 'skip', 'raise'}) – What to do with the existing crop. If 'ask' (default), interactively prompt the user. Otherwise, execute the specified action directly.

  • msg (str, optional) – Message to display when prompting.

  • e (Exception, optional) – Exception to re-raise if action is 'raise'.

  • overwrite (bool, optional) – Whether to overwrite existing data when reaping.

property all_nan_result

Get a stand-in result for cases which are missing still.

__str__()[source]
__repr__()[source]
parse_constants(constants=None)[source]
sow_combos(combos, cases=None, constants=None, shuffle=False, verbosity=1, desc='Sow', batchsize=None, num_batches=None)[source]

Sow combos to disk to be later grown, potentially in batches. Note if you have already sown this Crop, as long as the number of batches hasn’t changed (e.g. you have just tweaked the function or a constant argument), you can safely resow and only the batches will be overwritten, i.e. the results will remain.

Parameters:
  • combos (dict_like[str, iterable]) – The combinations to sow for all or some function arguments.

  • cases (iterable or mappings, optional) – Optionally provide a sequence of individual cases to sow for some or all function arguments.

  • constants (mapping, optional) – Provide additional constant function values to use when sowing.

  • shuffle (bool or int, optional) – If given, sow the combos in a random order (using random.seed and random.shuffle), which can be helpful for distributing resources when not all cases are computationally equal.

  • verbosity (int, optional) – How much information to show when sowing. 0: no output, 1: progress bar, 2: progress bar with each setting being sown.

  • desc (str, optional) – Description to show in the progress bar when sowing.

  • batchsize (int, optional) – If specified, set a new batchsize for the crop.

  • num_batches (int, optional) – If specified, set a new num_batches for the crop.

sow_cases(fn_args, cases, combos=None, constants=None, verbosity=1, batchsize=None, num_batches=None)[source]

Sow cases to disk to be later grown, potentially in batches.

Parameters:
  • fn_args (iterable[str] or str) – The names and order of the function arguments, can be None if each case is supplied as a dict.

  • cases (iterable or mappings, optional) – Sequence of individual cases to sow for all or some function arguments.

  • combos (dict_like[str, iterable]) – Combinations to sow for some or all function arguments.

  • constants (mapping, optional) – Provide additional constant function values to use when sowing.

  • verbosity (int, optional) – How much information to show when sowing. 0: no output, 1: progress bar, 2: progress bar with each setting being sown.

  • batchsize (int, optional) – If specified, set a new batchsize for the crop.

  • num_batches (int, optional) – If specified, set a new num_batches for the crop.

sow_samples(n, combos=None, constants=None, verbosity=1)[source]

Sow n samples to disk.

grow_subprocess(batch_ids=None, num_workers=None, num_threads=None, gpus=None, affinities=None, raise_errors=False, log=False, min_wait=1e-06, max_wait=0.1, verbosity=1, verbosity_grow=0, desc='Grow')[source]

Grow particular or missing batches using a single fresh subprocess per batch. This has a higher overhead for starting each process, but is more robust memory wise, and allows controlling the number of threads used, CPU affinity and GPU assignment.

Parameters:
  • batch_ids (int or sequence of int, optional) – Which batch numbers to grow, defaults to all missing.

  • num_workers (int, optional) – The maximum number of concurrent subprocesses (default 1).

  • num_threads (int, optional) – The number of threads per subprocess (default 1).

  • gpus (int, str, or sequence of int, optional) – GPU device IDs to assign to subprocesses via CUDA_VISIBLE_DEVICES. Each subprocess gets a single GPU from this pool; the pool also limits concurrency to the number of GPUs provided. You can oversubscribe GPUs by repeating device IDs, e.g. 0,0,1,1 to allow 2 subprocesses to share each GPU.

  • affinities (int, str, or sequence of int, optional) – CPU core IDs to pin subprocesses to via taskset. Also limits concurrency to the number of affinities.

  • raise_errors (bool, optional) – Whether to raise errors encountered during growing.

  • log (bool, optional) – Whether to save subprocess stdout and stderr to log files in the crop directory under logs/batch-{batch_id}.log. Default is False, which discards stdout and only prints stderr on error.

  • min_wait (float, optional) – Minimum polling interval in seconds.

  • max_wait (float, optional) – Maximum polling interval in seconds.

  • verbosity (int, optional) – How much information to show when growing. 0: no output, 1: progress bar, 2: progress bar with each setting being grown.

  • verbosity_grow (int, optional) – Verbosity within each batch grow.

  • desc (str, optional) – Description to show in the progress bar when sowing.

grow(batch_ids=None, subprocess='auto', num_workers=None, num_threads=None, gpus=None, affinities=None, raise_errors=False, debugging=False, verbosity=1, verbosity_grow=0, log=False, desc='Grow', **combo_runner_opts)[source]

Grow specific batch numbers using this process.

Parameters:
  • batch_ids (int or sequence of ints, optional) – Which batch numbers to grow, by default all missing results.

  • subprocess ("auto" or bool, optional) – Whether to grow each batch in a fresh subprocess. This adds about 1 second of overhead per batch, but allows the number of threads, cpu affinity and gpu assignment to be controlled. If “auto” (default) then subprocesses will be used if num_threads, gpus or affinities are specified. See Crop.grow_subprocess() for details.

  • num_workers (int, optional) – Maximum number of batches to run concurrently. In subprocess mode this is the cap on simultaneous subprocesses (defaults to 1 if not given). In in-process mode this is the size of the joblib loky process pool used by combo_runner_core (None = serial).

  • num_threads (int, optional) – Number of threads each worker is allowed to use, applied via the standard env vars (OMP_NUM_THREADS, MKL_NUM_THREADS, OPENBLAS_NUM_THREADS, …). Only meaningful in subprocess mode (the env vars must be set before numerical libraries are imported); setting it implies subprocess=True when subprocess="auto". Passing this with subprocess=False raises ValueError.

  • gpus (int, str, or sequence of int, optional) – GPU device IDs to assign to subprocesses via CUDA_VISIBLE_DEVICES. Each subprocess gets a single GPU from this pool; the pool also caps concurrency to its size. Repeat IDs to oversubscribe (e.g. "0,0,1,1" shares each GPU between two workers). Subprocess-mode only — implies subprocess=True when subprocess="auto"; raises ValueError with subprocess=False.

  • affinities (int, str, or sequence of int, optional) – CPU core IDs to pin subprocesses to via taskset. Each subprocess gets one affinity from the pool, which also caps concurrency. Subprocess-mode only — implies subprocess=True when subprocess="auto"; raises ValueError with subprocess=False.

  • raise_errors (bool, optional) – Whether to raise errors if they occur during growing.

  • debugging (bool, optional) – Whether to set the logging level to debug.

  • verbosity (int, optional) – How much information to show when growing. 0: no output, 1: progress bar, 2: progress bar with each setting being grown.

  • verbosity_grow (int, optional) – How much information to show when growing each batch.

  • log (bool, optional) – Whether to save subprocess output to log files. Only used when subprocess=True.

  • desc (str, optional) – Description to show in the progress bar when growing.

  • **combo_runner_opts – Additional options forwarded to either Crop.grow_subprocess() (min_wait, max_wait, …) when subprocess is True, or to combo_runner_core (executor, parallel, …) when subprocess is False.

grow_missing(**combo_runner_opts)[source]

Grow any missing results using this process.

reap_combos(wait=False, clean_up=None, allow_incomplete=False, verbosity=1, desc='Reap')[source]

Reap already sown and grown results from this crop.

Parameters:
  • wait (bool, optional) – Whether to wait for results to appear. If false (default) all results need to be in place before the reap.

  • clean_up (bool, optional) – Whether to delete all the batch files once the results have been gathered. If left as None this will be automatically set to not allow_incomplete.

  • allow_incomplete (bool, optional) – Allow only partially completed crop results to be reaped, incomplete results will all be filled-in as nan.

  • verbosity (int, optional) – How much information to show when reaping. 0: no output, 1: progress bar, 2: progress bar with each setting being reaped.

  • desc (str, optional) – Description to show in the progress bar when reaping.

Returns:

results – ‘N-dimensional’ tuple containing the results.

Return type:

nested tuple

reap_combos_to_ds(var_names=None, var_dims=None, var_coords=None, constants=None, attrs=None, parse=True, wait=False, clean_up=None, allow_incomplete=False, to_df=False, verbosity=1, desc='Reap')[source]

Reap a function over sowed combinations and output to a Dataset.

Parameters:
  • var_names (str, sequence of strings, or None) – Variable name(s) of the output(s) of fn, set to None if fn outputs data already labeled in a Dataset or DataArray.

  • var_dims (sequence of either strings or string sequences, optional) – ‘Internal’ names of dimensions for each variable, the values for each dimension should be contained as a mapping in either var_coords (not needed by fn) or constants (needed by fn).

  • var_coords (mapping, optional) – Mapping of extra coords the output variables may depend on.

  • constants (mapping, optional) – Arguments to fn which are not iterated over, these will be recorded either as attributes or coordinates if they are named in var_dims.

  • resources (mapping, optional) – Like constants but they will not be recorded.

  • attrs (mapping, optional) – Any extra attributes to store.

  • wait (bool, optional) – Whether to wait for results to appear. If false (default) all results need to be in place before the reap.

  • clean_up (bool, optional) – Whether to delete all the batch files once the results have been gathered. If left as None this will be automatically set to not allow_incomplete.

  • allow_incomplete (bool, optional) – Allow only partially completed crop results to be reaped, incomplete results will all be filled-in as nan.

  • to_df (bool, optional) – Whether to reap to a xarray.Dataset or a pandas.DataFrame.

  • verbosity (int, optional) – How much information to show when reaping. 0: no output, 1: progress bar, 2: progress bar with each setting being reaped.

  • desc (str, optional) – Description to show in the progress bar when reaping.

Returns:

Multidimensional labeled dataset containing all the results.

Return type:

xarray.Dataset or pandas.Dataframe

reap_runner(runner, wait=False, clean_up=None, allow_incomplete=False, to_df=False, verbosity=1, desc='Reap', **kwargs)[source]

Reap a Crop over sowed combos and save to a dataset defined by a Runner.

reap_harvest(harvester, wait=False, sync=True, overwrite=None, clean_up=None, allow_incomplete=False, verbosity=1, desc='Reap', **kwargs)[source]

Reap a Crop over sowed combos and merge with the dataset defined by a Harvester.

reap_samples(sampler, wait=False, sync=True, clean_up=None, allow_incomplete=False, verbosity=1, desc='Reap', **kwargs)[source]

Reap a Crop over sowed combos and merge with the dataframe defined by a Sampler.

reap(wait=False, sync=True, overwrite=None, clean_up=None, allow_incomplete=False, verbosity=1, desc='Reap')[source]

Reap sown and grown combos from disk. Return a dataset if a runner or harvester is set, otherwise, the raw nested tuple.

Parameters:
  • wait (bool, optional) – Whether to wait for results to appear. If false (default) all results need to be in place before the reap.

  • sync (bool, optional) – Immediately sync the new dataset with the on-disk full dataset or dataframe if a harvester or sampler is used.

  • overwrite (bool, optional) – How to compare data when syncing to on-disk dataset. If None, (default) merge as long as no conflicts. True: overwrite with the new data. False, discard any new conflicting data.

  • clean_up (bool, optional) – Whether to delete all the batch files once the results have been gathered. If left as None this will be automatically set to not allow_incomplete.

  • allow_incomplete (bool, optional) – Allow only partially completed crop results to be reaped, incomplete results will all be filled-in as nan.

  • verbosity (int, optional) – How much information to show when reaping. 0: no output, 1: progress bar, 2: progress bar with each setting being reaped.

  • desc (str, optional) – Description to show in the progress bar when reaping.

Return type:

nested tuple or xarray.Dataset

check_bad(delete_bad=True)[source]

Check that the result dumps are not bad -> sometimes length does not match the batch. Optionally delete these so that they can be re-grown.

Parameters:

delete_bad (bool) – Delete bad results as they are come across.

Returns:

bad_ids – The bad batch numbers.

Return type:

tuple

_get_fn()[source]
_set_fn(fn)[source]
_del_fn()[source]
fn
property num_sown_batches

Total number of batches to be run/grown.

property num_results
xyzpy.clean_slurm_outputs(job, directory='.', cancel_if_finished=True)[source]
xyzpy.grow(batch_number, crop=None, fn=None, num_workers=None, check_mpi=True, verbosity=2, debugging=False, raise_errors=True)[source]

Automatically process a batch of cases into results. Should be run in an “.xyz-{fn_name}” folder, or crop should be specified.

Parameters:
  • batch_number (int) – Which batch to ‘grow’ into a set of results.

  • crop (xyzpy.Crop) – Description of where and how to store the cases and results.

  • fn (callable, optional) – If specified, the function used to generate the results, otherwise the function will be loaded from disk.

  • num_workers (int, optional) – If specified, grow using a pool of this many workers. This uses joblib.externals.loky to spawn processes.

  • check_mpi (bool, optional) – Whether to check if the process is rank 0 and only save results if so - allows mpi functions to be simply used. Defaults to true, this should only be turned off if e.g. a pool of workers is being used to run different grow instances.

  • verbosity ({0, 1, 2}, optional) – How much information to show.

  • debugging (bool, optional) – Set logging level to DEBUG.

  • raise_errors (bool, optional) – Whether to raise errors that occur during the computation. If growing many batches in parallel, it can be useful to set this to False so a single error doesn’t crash the whole process.

xyzpy.load_crops(directory='.')[source]

Automatically load all the crops found in the current directory.

Parameters:

directory (str, optional) – Which directory to load the crops from, defaults to ‘.’ - the current.

Returns:

Mapping of the crop name to the Crop.

Return type:

dict[str, Crop]

xyzpy.manage_slurm_outputs(crop, job, wait_time=60)[source]
class xyzpy.Harvester(runner: Runner, data_name=None, chunks=None, engine='h5netcdf', full_ds=None)[source]

Bases: object

Container class for collecting and aggregating data to disk.

Parameters:
  • runner (Runner) – Performs the runs and describes the results.

  • data_name (str, optional) – Base file path to save data to.

  • chunks (int or dict, optional) – If not None, passed to xarray so that the full dataset is loaded and merged into with on-disk dask arrays.

  • engine (str, optional) – Engine to use to save and load datasets.

  • full_ds (xarray.Dataset) – Initialize the Harvester with this dataset as the intitial full dataset.

  • Members

  • -------

  • full_ds – Dataset containing all data harvested so far, by default synced to disk.

  • last_ds (xarray.Dataset) – Dataset containing just the data from the last harvesting run.

runner
data_name = None
engine = 'h5netcdf'
chunks = None
_full_ds = None
property fn
__call__(*args, **kwargs)[source]
property last_ds

Dataset containing the last runs’ data.

load_full_ds(chunks=None, engine=None)[source]

Load the disk dataset into full_ds.

Parameters:
  • chunks (int or dict, optional) – If not None, passed to xarray so that the full dataset is loaded and merged into with on-disk dask arrays.

  • engine (str, optional) – Engine to use to save and load datasets.

property full_ds

Dataset containing all saved runs.

save_full_ds(new_full_ds=None, engine=None)[source]

Save full_ds onto disk. The old file is moved and kept as a backup in case of errors when writing the new dataset to disk.

Parameters:
  • new_full_ds (xarray.Dataset, optional) – Save this dataset as the new full dataset, else use the current full datset.

  • engine (str, optional) – Engine to use to save and load datasets.

delete_ds(backup=False)[source]

Delete the on-disk dataset, optionally backing it up first.

add_ds(new_ds, sync=True, overwrite=None, chunks=None, engine=None)[source]

Merge a new dataset into the in-memory full dataset.

Parameters:
  • new_ds (xr.Dataset or xr.DataArray) – Data to be merged into the full dataset.

  • sync (bool, optional) – If True (default), load and save the disk dataset before and after merging in the new data.

  • overwrite ({None, False, True}, optional) –

    How to combine data from the new run into the current full_ds:

    • None (default): attempt the merge and only raise if data conflicts.

    • True: overwrite conflicting current data with that from the new dataset.

    • False: drop any conflicting data from the new dataset.

  • chunks (int or dict, optional) – If not None, passed to xarray so that the full dataset is loaded and merged into with on-disk dask arrays.

  • engine (str, optional) – Engine to use to save and load datasets.

expand_dims(name, value, engine=None)[source]

Add a new coordinate dimension with name and value. The change is immediately synced with the on-disk dataset. Useful if you want to expand the parameter space along a previously constant argument.

drop_sel(labels=None, *, errors='raise', engine=None, **labels_kwargs)[source]

Drop specific values of coordinates from this harvester and its dataset. See http://xarray.pydata.org/en/latest/generated/xarray.Dataset.drop_sel.html. The change is immediately synced with the on-disk dataset. Useful for tidying uneeded data points.

_maybe_expand_combos(combos)[source]

Expand combos with ellipses into full coordinate values from the current full dataset.

harvest_combos(combos, *, cases=None, missing_only=False, sync=True, overwrite=None, chunks=None, engine=None, **runner_settings)[source]

Run combos, automatically merging into an on-disk dataset.

Parameters:
  • combos (dict_like[str, iterable]) – The combos to run. The only difference here is that you can supply an ellipse ..., meaning the all values for that coordinate will be loaded from the current full dataset.

  • missing_only (bool, optional) – If True, only run combos that are not already present in the on-disk dataset.

  • sync (bool, optional) – If True (default), load and save the disk dataset before and after merging in the new data.

  • overwrite ({None, False, True}, optional) –

    • None (default): attempt the merge and only raise if data conflicts.

    • True: overwrite any conflicting current data with that from the new dataset.

    • False: drop any conflicting data from the new dataset.

  • chunks (bool, optional) – If not None, passed passed to xarray so that the full dataset is loaded and merged into with on-disk dask arrays.

  • engine (str, optional) – Engine to use to save and load datasets.

  • runner_settings – Supplied to combo_runner().

harvest_cases(cases, *, sync=True, overwrite=None, chunks=None, engine=None, **runner_settings)[source]

Run cases, automatically merging into an on-disk dataset.

Parameters:
  • cases (list of dict or tuple) – The cases to run.

  • sync (bool, optional) – If True (default), load and save the disk dataset before and after merging in the new data.

  • overwrite ({None, False, True}, optional) –

    What to do regarding clashes with old data:

    • None (default): attempt the merge and only raise if data conflicts.

    • True: overwrite conflicting current data with that from the new dataset.

    • False: drop any conflicting data from the new dataset.

  • chunks (bool, optional) – If not None, passed passed to xarray so that the full dataset is loaded and merged into with on-disk dask arrays.

  • engine (str, optional) – Engine to use to save and load datasets.

  • runner_settings – Supplied to case_runner().

Crop(name=None, parent_dir=None, save_fn=None, batchsize=None, num_batches=None)[source]

Return a Crop instance with this Harvester, from which fn will be set, and then combos can be sown, grown, and reaped into the Harvester.full_ds. See Crop.

Return type:

Crop

__repr__()[source]
cultivate(combos=None, cases=None, constants=None, name=None, parent_dir=None, batchsize=None, num_batches=None, missing_only=True, shuffle=True, subprocess='auto', num_workers=None, num_threads=None, gpus=None, affinities=None, log=False, raise_errors=True, verbosity=1, on_existing='ask', on_error='ask', clean_up=None, **grow_kwargs)[source]

Convenience method to run a full cycle of parsing combos into missing cases only, then persistently growing those cases, and finally merging the results into the full dataset.

Parameters:
  • combos (dict_like[str, iterable]) – The combos to run. The only difference here is that you can supply an ellipse ..., meaning the all values for that coordinate will be loaded from the current full dataset.

  • cases (sequence of mappings or tuples, optional) – A sequence of (partial) individual settings to run. For each case, all settings given by combos will be generated.

  • constants (dict, optional) – Extra constant arguments for this run.

  • name (str, optional) – Name for the crop to be used for on-disk storage of batches, results and logs. You can use different names to grow results for the same dataset concurrently.

  • parent_dir (str, optional) – Parent directory in which to create the crop folder (.xyz-{name}/). Defaults to the current working directory.

  • batchsize (int, optional) – If given, the target number of cases to sow in each batch. This is computed from num_batches if not given and 1 if neither given.

  • num_batches (int, optional) – If given, the target number of batches to sow. This is computed from batchsize if not given and 1 if neither given.

  • missing_only (bool, optional) – If True (default), only run cases that are not already present in the on-disk dataset. If False, the new results will overwrite any existing results.

  • shuffle (bool, optional) – If True (default), shuffle the order of cases before sowing and growing. This can be a useful basic form of load balancing.

  • subprocess ("auto" or bool, optional) – Whether to grow each batch in a fresh subprocess. This adds about 1 second of overhead per batch, but allows the number of threads, cpu affinity and gpu assignment to be controlled. If “auto” (default) subprocesses are used when num_threads, gpus or affinities are specified. See xyzpy.Crop.grow() for details.

  • num_workers (int, optional) – Maximum number of batches to run concurrently. In subprocess mode this caps simultaneous subprocesses (defaults to 1 if not given). In in-process mode this is the joblib loky pool size (None = serial). Forwarded to xyzpy.Crop.grow().

  • num_threads (int, optional) – Number of threads each worker is allowed to use, applied via the standard env vars (OMP_NUM_THREADS, MKL_NUM_THREADS, etc.) in each subprocess. Implies subprocess=True when subprocess="auto". Forwarded to xyzpy.Crop.grow().

  • gpus (int, str, or sequence of int, optional) – GPU device IDs to assign to subprocesses via CUDA_VISIBLE_DEVICES; the pool also caps concurrency. Implies subprocess=True when subprocess="auto". Forwarded to xyzpy.Crop.grow().

  • affinities (int, str, or sequence of int, optional) – CPU core IDs to pin subprocesses to via taskset; the pool also caps concurrency. Implies subprocess=True when subprocess="auto". Forwarded to xyzpy.Crop.grow().

  • log (bool, optional) – Whether to save subprocess stdout and stderr to files in the crop directory under logs/batch-{batch_id}.log. Subprocess-mode only. Forwarded to xyzpy.Crop.grow().

  • raise_errors (bool, optional) – If True (default), raise any errors that occur during growing, otherwise just log them and continue with the next batch.

  • verbosity (int, optional) – The level of logging to print during the sow/grow/reap process. 0: no output, 1: progress bars, 2: progress bars with current setting postfixed.

  • on_existing ({'ask', 'reap', 'delete', 'skip', 'raise'}, optional) – What to do if a crop with the same name already exists on disk. Default is 'ask' (interactive prompt).

  • on_error ({'ask', 'reap', 'delete', 'skip', 'raise'}, optional) – What to do if an error occurs during grow/reap. Default is 'ask' (interactive prompt).

  • clean_up (bool or None, optional) – Whether to delete the on-disk batch, result and log files after successfully reaping.

  • grow_kwargs – Further keyword arguments forwarded to xyzpy.Crop.grow() (e.g. executor, min_wait, …).

class xyzpy.Runner(fn, var_names, fn_args=None, var_dims=None, var_coords=None, constants=None, resources=None, attrs=None, **default_runner_settings)[source]

Bases: object

Container class with all the information needed to systematically run a function over many parameters and capture the output in a dataset.

Parameters:
  • fn (callable) – Function that produces a single instance of a result.

  • var_names (str, sequence of str, or None) – The ordered name(s) of the ouput variable(s) of fn. Set this explicitly to None if fn outputs already labelled data as a dict, Dataset, or DataArray.

  • fn_args (str, or sequence of str, optional) – The ordered name(s) of the input arguments(s) of fn. This is only needed if the cases or combos supplied are not dict-like.

  • var_dims (dict-like, optional) – Mapping of output variables to their named internal dimensions, can be the names of constants.

  • var_coords (dict-like, optional) – Mapping of output variables named internal dimensions to the actual values they take.

  • constants (dict-like, optional) – Constants arguments to be supplied to fn. These can be used as ‘var_dims’, and will be saved as coords if so, otherwise as attributes.

  • resources (dict-like, optional) – Like constants but not saved to the the dataset, e.g. if very big.

  • attrs (dict-like, optional) – Any other miscelleous information to be saved with the dataset.

  • default_runner_settings – These keyword arguments will be supplied as defaults to any runner.

fn
_var_names = (None,)
_fn_args
_var_dims
_var_coords
_constants
_resources
_attrs
_last_ds = None
default_runner_settings
__call__(*args, **kwargs)[source]
_get_fn_args()[source]
_set_fn_args(fn_args)[source]
_del_fn_args()[source]
fn_args
_get_var_names()[source]
_set_var_names(var_names)[source]
_del_var_names()[source]
var_names
_get_var_dims()[source]
_set_var_dims(var_dims, var_names=None)[source]
_del_var_dims()[source]
var_dims
_get_var_coords()[source]
_set_var_coords(var_coords)[source]
_del_var_coords()[source]
var_coords
_get_constants()[source]
_set_constants(constants)[source]
_del_constants()[source]
constants
_get_resources()[source]
_set_resources(resources)[source]
_del_resources()[source]
resources
property last_ds
run_combos(combos, constants=(), **runner_settings)[source]

Run combos using the function map and save to dataset.

Parameters:
  • combos (dict_like[str, iterable]) – The values of each function argument with which to evaluate all combinations.

  • constants (dict, optional) – Extra constant arguments for this run, repeated arguments will take precedence over stored constants but for this run only.

  • runner_settings – Keyword arguments supplied to combo_runner().

run_cases(cases, constants=(), fn_args=None, **runner_settings)[source]

Run cases using the function and save to dataset.

Parameters:
  • cases (sequence of mappings or tuples) – A sequence of cases.

  • constants (dict (optional)) – Extra constant arguments for this run, repeated arguments will take precedence over stored constants but for this run only.

  • runner_settings – Supplied to case_runner().

Crop(name=None, parent_dir=None, save_fn=None, batchsize=None, num_batches=None)[source]

Return a Crop instance with this runner, from which fn will be set, and then combos can be sown, grown, and reaped into the Runner.last_ds. See Crop.

Return type:

Crop

__repr__()[source]
class xyzpy.Sampler(runner, data_name=None, default_combos=None, full_df=None, engine='pickle')[source]

Like a Harvester, but randomly samples combos and writes the table of results to a pandas.DataFrame.

Parameters:
  • runner (xyzpy.Runner) – Runner describing a labelled function to run.

  • data_name (str, optional) – If given, the on-disk file to sync results with.

  • default_combos (dict_like[str, iterable], optional) – The default combos to sample from (which can be overridden).

  • full_df (pandas.DataFrame, optional) – If given, use this dataframe as the initial ‘full’ data.

  • engine ({'pickle', 'csv', 'json', 'hdf', ...}, optional) – How to save and load the on-disk dataframe. See load_df() and save_df().

full_df

Dataframe describing all data harvested so far.

Type:

pandas.DataFrame

last_df

Dataframe describing the data harvested on the previous run.

Type:

pandas.Dataframe

runner
data_name = None
default_combos
_full_df = None
_last_df = None
engine = 'pickle'
property fn
load_full_df(engine=None)[source]

Load the on-disk full dataframe into memory.

property full_df

The dataframe describing all data harvested so far.

property last_df

The dataframe describing the last set of data harvested.

save_full_df(new_full_df=None, engine=None)[source]

Save full_df onto disk.

Parameters:
  • new_full_df (pandas.DataFrame, optional) – Save this dataframe as the new full dataframe, else use the current full_df.

  • engine (str, optional) – Which engine to save the dataframe with, if None use the default.

delete_df(backup=False)[source]

Delete the on-disk dataframe, optionally backing it up first.

add_df(new_df, sync=True, engine=None)[source]

Merge a new dataset into the in-memory full dataset.

Parameters:
  • new_df (pandas.DataFrame or dict) – Data to be appended to the full dataset.

  • sync (bool, optional) – If True (default), load and save the disk dataframe before and after merging in the new data.

  • engine (str, optional) – Which engine to save the dataframe with.

gen_cases_fnargs(n, combos=None)[source]
sample_combos(n, combos=None, engine=None, **case_runner_settings)[source]

Sample the target function many times, randomly choosing parameter combinations from combos (or SampleHarvester.default_combos).

Parameters:
  • n (int) – How many samples to run.

  • combos (dict_like[str, iterable], optional) – A mapping of function arguments to potential choices. Any keys in here will override default_combos. You can also suppply a callable to manually return a random choice e.g. from a probability distribution.

  • engine (str, optional) – Which method to use to sync with the on-disk dataframe.

  • case_runner_settings – Supplied to case_runner() and so onto combo_runner(). This includes parallel=True etc.

Crop(name=None, parent_dir=None, save_fn=None, batchsize=None, num_batches=None)[source]

Return a Crop instance with this Sampler, from which fn will be set, and then samples can be sown, grown, and reaped into the Sampler.full_df. See Crop.

Return type:

Crop

__repr__()[source]
xyzpy.cultivate(fn, *, var_names=None, data_name=None, runner_opts=None, harvester_opts=None, combos=None, cases=None, constants=None, name=None, parent_dir=None, batchsize=None, num_batches=None, missing_only=True, shuffle=True, subprocess='auto', num_workers=None, num_threads=None, gpus=None, affinities=None, log=False, raise_errors=True, verbosity=1, on_existing='ask', on_error='ask', clean_up=None, **grow_kwargs)[source]

Convenience function to run a full cycle of annotating a function, parsing combos into missing cases only, then persistently growing those cases, and finally merging the results into the full dataset.

Parameters:
  • fn (callable) – The function to run over combos and cases. This will be wrapped in a Runner and Harvester to perform the cultivation process. If var_names is None, it should return a dict, Dataset or DataArray.

  • var_names (str, sequence of str, or None) – The ordered name(s) of the ouput variable(s) of fn. Set this explicitly to None if fn outputs already labelled data as a dict, Dataset, or DataArray.

  • data_name (str, optional) – If given, the on-disk file to sync results with. If not set there will be no persistent results, since the harvester created in this functional interface is ephemeral.

  • runner_opts (dict, optional) – Keyword arguments to be supplied to Runner.

  • harvester_opts (dict, optional) – Keyword arguments to be supplied to Harvester.

  • combos (dict_like[str, iterable]) – The combos to run. The only difference here is that you can supply an ellipse ..., meaning the all values for that coordinate will be loaded from the current full dataset.

  • cases (sequence of mappings or tuples, optional) – A sequence of (partial) individual settings to run. For each case, all settings given by combos will be generated.

  • constants (dict, optional) – Extra constant arguments for this run.

  • name (str, optional) – Name for the crop to be used for on-disk storage of batches, results and logs. You can use different names to grow results for the same dataset concurrently.

  • parent_dir (str, optional) – Parent directory in which to create the crop folder (.xyz-{name}/). Defaults to the current working directory.

  • batchsize (int, optional) – If given, the target number of cases to sow in each batch. This is computed from num_batches if not given and 1 if neither given.

  • num_batches (int, optional) – If given, the target number of batches to sow. This is computed from batchsize if not given and 1 if neither given.

  • missing_only (bool, optional) – If True (default), only run cases that are not already present in the on-disk dataset

  • shuffle (bool, optional) – If True (default), shuffle the order of cases before sowing and growing. This can be a useful basic form of load balancing.

  • subprocess ("auto" or bool, optional) – Whether to grow each batch in a fresh subprocess. This adds about 1 second of overhead per batch, but allows the number of threads, cpu affinity and gpu assignment to be controlled. If “auto” (default) subprocesses are used when num_threads, gpus or affinities are specified. See xyzpy.Crop.grow() for details.

  • num_workers (int, optional) – Maximum number of batches to run concurrently. In subprocess mode this caps simultaneous subprocesses (defaults to 1 if not given). In in-process mode this is the joblib loky pool size (None = serial). Forwarded to xyzpy.Crop.grow().

  • num_threads (int, optional) – Number of threads each worker is allowed to use, applied via the standard env vars (OMP_NUM_THREADS, MKL_NUM_THREADS, etc.) in each subprocess. Implies subprocess=True when subprocess="auto". Forwarded to xyzpy.Crop.grow().

  • gpus (int, str, or sequence of int, optional) – GPU device IDs to assign to subprocesses via CUDA_VISIBLE_DEVICES; the pool also caps concurrency. Implies subprocess=True when subprocess="auto". Forwarded to xyzpy.Crop.grow().

  • affinities (int, str, or sequence of int, optional) – CPU core IDs to pin subprocesses to via taskset; the pool also caps concurrency. Implies subprocess=True when subprocess="auto". Forwarded to xyzpy.Crop.grow().

  • log (bool, optional) – Whether to save subprocess stdout and stderr to files in the crop directory under logs/batch-{batch_id}.log. Subprocess-mode only. Forwarded to xyzpy.Crop.grow().

  • raise_errors (bool, optional) – If True (default), raise any errors that occur during growing, otherwise just log them and continue with the next batch.

  • verbosity (int, optional) – The level of logging to print during the sow/grow/reap process. 0: no output, 1: progress bars, 2: progress bars with current setting postfixed.

  • on_existing ({'ask', 'reap', 'delete', 'skip', 'raise'}, optional) – What to do if a crop with the same name already exists on disk. Default is 'ask' (interactive prompt).

  • on_error ({'ask', 'reap', 'delete', 'skip', 'raise'}, optional) – What to do if an error occurs during grow/reap. Default is 'ask' (interactive prompt).

  • clean_up (bool or None, optional) – Whether to delete the on-disk batch, result and log files after successfully reaping.

  • grow_kwargs – Further keyword arguments forwarded to xyzpy.Crop.grow() (e.g. executor, min_wait, …).

xyzpy.label(var_names, fn_args=None, var_dims=None, var_coords=None, constants=None, resources=None, attrs=None, harvester=False, sampler=False, engine=None, **default_runner_settings)[source]

Convenient decorator to automatically wrap a function as a Runner or Harvester.

Parameters:
  • var_names (str, sequence of str, or None) – The ordered name(s) of the ouput variable(s) of fn. Set this explicitly to None if fn outputs already labelled data as a dict, Dataset, or DataArray.

  • fn_args (str, or sequence of str, optional) – The ordered name(s) of the input arguments(s) of fn. This is only needed if the cases or combos supplied are not dict-like.

  • var_dims (dict-like, optional) – Mapping of output variables to their named internal dimensions, can be the names of constants.

  • var_coords (dict-like, optional) – Mapping of output variables named internal dimensions to the actual values they take.

  • constants (dict-like, optional) – Constants arguments to be supplied to fn. These can be used as ‘var_dims’, and will be saved as coords if so, otherwise as attributes.

  • resources (dict-like, optional) – Like constants but not saved to the the dataset, e.g. if very big.

  • attrs (dict-like, optional) – Any other miscelleous information to be saved with the dataset.

  • harvester (bool or str, optional) – If True, wrap the runner as a Harvester, if a string, create the harvester with that as the data_name.

  • default_runner_settings – These keyword arguments will be supplied as defaults to any runner.

Examples

Declare a function as a runner directly:

>>> import xyzpy as xyz

>>> @xyz.label(var_names=['sum', 'diff'])
... def foo(x, y):
...     return x + y, x - y
...

>>> foo
<xyzpy.Runner>
    fn: <function foo at 0x7f1fd8e5b1e0>
    fn_args: ('x', 'y')
    var_names: ('sum', 'diff')
    var_dims: {'sum': (), 'diff': ()}

>>> foo(1, 2)  # can still call it normally
(3, -1)
class xyzpy.RayExecutor(*args, default_remote_opts=None, **kwargs)[source]

Basic concurrent.futures like interface using ray.

Example usage:

from xyzpy import RayExecutor

# create a pool that by default requests a single gpu per task
pool = RayExecutor(
    num_cpus=4,
    num_gpus=4,
    default_remote_opts={"num_gpus": 1},
)
default_remote_opts
_maybe_inject_remote_opts(remote_opts=None)[source]

Return the default remote options, possibly overriding some with those supplied by a submit call.

submit(fn, *args, pure=False, remote_opts=None, **kwargs)[source]

Remotely run fn(*args, **kwargs), returning a RayFuture.

map(func, *iterables, remote_opts=None)[source]

Remote map func over arguments iterables.

scatter(data)[source]

Push data into the distributed store, returning an ObjectRef that can be supplied to submit calls for example.

shutdown()[source]

Shutdown the parent ray cluster, this RayExecutor instance itself does not need any cleanup.

class xyzpy.RayGPUExecutor(*args, gpus_per_task=1, **kwargs)[source]

Bases: RayExecutor

A RayExecutor that by default requests a single gpu per task.

xyzpy.auto_xyz_ds(x, y_z=None)[source]

Automatically turn an array into a xarray dataset. Transpose y_z if necessary to automatically match dimension sizes.

Parameters:
  • x (array_like) – The x-coordinates.

  • y_z (array_like, optional) – The y-data, possibly varying with coordinate z.

xyzpy.cache_to_disk(fn=None, *, cachedir=_DEFAULT_FN_CACHE_PATH, **kwargs)[source]

Cache this function to disk, using joblib.

xyzpy.check_runs(obj, dim='run', var=None, sel=())[source]

Print out information about the range and any missing values for an integer dimension.

Parameters:
  • obj (xarray object) – Data to check.

  • dim (str (optional)) – Dimension to check, defaults to ‘run’.

  • var (str (optional)) – Subselect this data variable first.

  • sel (mapping (optional)) – Subselect these other coordinates first.

xyzpy.load_df(name, engine='pickle', key='df', **kwargs)[source]

Load a dataframe from disk.

Parameters:
  • name (str) – File name to read from.

  • engine ({'pickle', 'csv', 'hdf'}, optional) – Storage backend.

  • key (str, optional) – HDF key when engine='hdf'.

  • **kwargs – Passed through to the pandas reader.

Returns:

Loaded dataframe.

Return type:

pandas.DataFrame

xyzpy.load_ds(file_name, engine='h5netcdf', load_to_mem=None, create_new=False, chunks=None, **kwargs)[source]

Loads a xarray dataset. Basically xarray.open_dataset with some different defaults and convenient behaviour.

Parameters:
  • file_name (str) – Name of file to open.

  • engine ({'h5netcdf', 'netcdf4', 'joblib', 'zarr'}, optional) – Engine used to load file.

  • load_to_mem (bool, optional) – Ince opened, load from disk into memory. Defaults to True if chunks=None.

  • create_new (bool, optional) – If no file exists make a blank one.

  • chunks (int or dict) – Passed to xarray.open_dataset so that data is stored using dask.array.

Returns:

ds – Loaded Dataset.

Return type:

xarray.Dataset

xyzpy.merge_sync_conflict_datasets(base_name, engine='h5netcdf', combine_first=False)[source]

Glob files based on base_name, merge them, save this new dataset if it contains new info, then clean up the conflicts.

Parameters:
  • base_name (str) – Base file name to glob on - should include ‘*’.

  • engine (str , optional) – Load and save engine used by xarray.

  • combine_first (bool, optional) – If True, combine datasets sequentially using combine_first, preferring the first dataset in the list, which is assumed to be the original. If False, merge all datasets together using xr.merge, which will raise an error if there are any conflicts.

xyzpy.save_df(df, name, engine='pickle', key='df', **kwargs)[source]

Save a dataframe to disk.

Parameters:
  • df (pandas.DataFrame) – DataFrame to save.

  • name (str) – File name to save to.

  • engine ({'pickle', 'csv', 'hdf'}, optional) – Storage backend.

  • key (str, optional) – HDF key when engine='hdf'.

  • **kwargs – Passed through to the pandas writer.

xyzpy.save_ds(ds, file_name, engine='h5netcdf', **kwargs)[source]

Saves a xarray dataset.

Parameters:
  • ds (xarray.Dataset) – The dataset to save.

  • file_name (str) – Name of the file to save to.

  • engine ({'h5netcdf', 'netcdf4', 'joblib', 'zarr'}, optional) – Engine used to save file with.

Return type:

None

xyzpy.save_merge_ds(ds, fname, overwrite=None, **kwargs)[source]

Save dataset ds, but check for an existing dataset with that name first, and if it exists, merge the two before saving.

Parameters:
  • ds (xarray.Dataset) – The dataset to save.

  • fname (str) – The file name.

  • overwrite ({None, False, True}, optional) –

    How to merge the dataset with the existing dataset.

    • None: the datasets will be merged in there are no conflicts

    • False: data will be taken from old dataset if conflicting

    • True: data will be taken from new dataset if conflicting

xyzpy.sort_dims(ds)[source]

Reorder variable dimensions to match ds.dims. This is an inplace operation.

Parameters:

ds (xarray.Dataset) – Dataset to reorder in place.

Return type:

None

xyzpy.trimna(obj)[source]

Drop values across dims where all values are NaN.

Parameters:

obj (xarray.Dataset or xarray.DataArray) – Object to trim.

Returns:

Trimmed object.

Return type:

same type as obj

xyzpy.cimluv(hue, hue_shift=0.0, sat1=1.0, sat2=0.5, val1=0.8, val2=0.3, N=30, reverse=False)[source]

Creates a color map for single hue, using HSLuv color space.

xyzpy.cimple(hue, sat1=0.4, sat2=1.0, val1=0.95, val2=0.35, hue_shift=0.0, name='cimple', auto_adjust_sat=0.2)[source]

Creates a color map for a single hue.

xyzpy.cimple_bright(hue, sat1=0.8, sat2=0.9, val1=0.97, val2=0.3, hue_shift=0.0, name='cimple_bright')[source]

Creates a color map for a single hue, with bright defaults.

xyzpy.cmoke(hue, hue_shift=0.0, sat1=0.36, sat2=0.5, val1=0.38, val2=0.93, N=51, reverse=False)[source]

Creates a color map for single hue, using OKLCH color space.

xyzpy.convert_colors(cols, outformat, informat='MATPLOTLIB')[source]

Convert lists of colors between formats

xyzpy.get_neutral_style(draw_color=(0.5, 0.5, 0.5))[source]
xyzpy.infiniplot(ds, x, y=None, z=None, **kwargs)[source]

Helper class for the infiniplot functionality.

Parameters:
  • ds (xarray.Dataset) – Dataset to plot.

  • x (str) – Name of the x coordinate.

  • y (str, optional) – Name of the y coordinate. If not specified, histogram mode is activated and the values of x are binned to produce a density or frequency to use as the y-variable.

  • z (str, optional) – Name of the z coordinate. If specified this turns on the heatmap mode.

  • bins (int or array_like, optional) – If in histogram mode, specify either the number of bins to use or the bin edges. If not specified, a default number of bins is automatically chosen based on the number of data points.

  • bins_density (bool, optional) – If in histogram mode, whether to plot the density (True) or frequency (False) of the data. Default is True.

  • aggregate (str or Sequence[str], optional) – If specified, aggregate over the given dimension(s) using aggregate_method (by default ‘median’). If True aggregate over all unmapped dimensions. If in heatmap mode, this is automatically set to True, since only one plot can be shown per axis.

  • aggregate_method (str, optional) – If aggregate is specified, the method to use for aggregation. Any option available as a method on a DataArray can be used, e.g. ‘mean’, ‘median’, ‘max’. Default is ‘median’.

  • aggregate_err_range (float or str, optional) –

    If aggregate is specified, the range of the error bars or bands to show. The options are:

    • 'std': show the standard deviation of the data

    • 'stderr': show the standard error of the mean

    • float: show the given quantile range, e.g. 0.5 for the interquartile

      range

  • err (str, optional) – If specified, a data variable to use for error bars or bands. This overrides any derived from aggregate.

  • err_style (str, optional) –

    If specified, the style of error to show. The options are:

    • 'bars': show error bars

    • 'band': show error bands

  • err_kws (dict, optional) – Additional keyword arguments to pass to the error plotting function.

  • xlink (str, optional) – If specified, the name of a dimension to use for linking the x-axis. Used when you are plotting a variable rather than coordinate as x, but want to link each sweep of values as a line.

  • color (str, optional) – If specified, the name of a dimension to use for mapping the color or intensity of each line. If hue is also specified, this controls the intensity of the color. If not a dimension, this is used as a constant color for all lines.

  • colors (sequence, optional) – An explicit sequence of colors to use for the color-mapped dimension.

  • color_order (sequence, optional) – An explicit order of values to use for the color-mapped dimension.

  • color_label (str, optional) – An alternate label to use for the color-mapped dimension.

  • color_ticklabels (dict or sequence, optional) – A mapping from values to tick labels to use for the color-mapped dimension.

  • colormap_start (float, optional) – If using a palette, the starting value of the colormap to use, e.g. 0.2 would skip the first 20% of the colormap.

  • colormap_stop (float, optional) – If using a palette, the stopping value of the colormap to use, e.g. 0.9 would skip the last 10% of the colormap.

  • hue (str, optional) – If specified, the name of a dimension to use for mapping the color or hue of each line. If color is also specified, this controls the hue of the color. If not a dimension, this is used as a constant hue for all lines.

  • hues (sequence, optional) – An explicit sequence of hues to use for the hue-mapped dimension.

  • hue_order (sequence, optional) – An explicit order of values to use for the hue-mapped dimension.

  • hue_label (str, optional) – An alternate label to use for the hue-mapped dimension.

  • hue_ticklabels (dict or sequence, optional) – A mapping from values to tick labels to use for the hue-mapped dimension.

  • palette (str, sequence, or colormap, optional) – If specified, the name of a colormap, or an actual colormap, to use for mapping the color or hue of each line. If both color and hue are specified, you can supply a sequence of palettes here, with hue controlling which palette, and color controlling the intensity within the palette.

  • autohue_start (float, optional) – If not using a palette, the starting hue to use for automatically generating a sequence of hues.

  • autohue_sweep (float, optional) – If not using a palette, the sweep of hues to use for automatically generating a sequence of hues.

  • autohue_opts (dict, optional) – Additional keyword arguments to pass to the automatic hue generator - see {func}`xyzpy.color.cmoke`.

  • marker (str, optional) – If specified, the name of a dimension to use for mapping the marker style of each line. If not a dimension, this is used as a constant marker style for all lines.

  • markers (sequence, optional) – An explicit sequence of markers to use for the marker-mapped dimension.

  • marker_order (sequence, optional) – An explicit order of values to use for the marker-mapped dimension.

  • marker_label (str, optional) – An alternate label to use for the marker-mapped dimension.

  • marker_ticklabels (dict or sequence, optional) – A mapping from values to tick labels to use for the marker-mapped dimension.

  • markersize (str, optional) – If specified, the name of a dimension to use for mapping the marker size of each line. If not a dimension, this is used as a constant marker size for all lines.

  • markersizes (sequence, optional) – An explicit sequence of marker sizes to use for the markersize-mapped dimension.

  • markersize_order (sequence, optional) – An explicit order of values to use for the markersize-mapped dimension.

  • markersize_label (str, optional) – An alternate label to use for the markersize-mapped dimension.

  • markersize_ticklabels (dict or sequence, optional) – A mapping from values to tick labels to use for the markersize-mapped dimension.

  • markeredgecolor (str, optional) – If specified, the name of a dimension to use for mapping the marker edge color of each line. If not a dimension, this is used as a constant marker edge color for all lines.

  • markeredgecolors (sequence, optional) – An explicit sequence of marker edge colors to use for the markeredgecolor-mapped dimension.

  • markeredgecolor_order (sequence, optional) – An explicit order of values to use for the markeredgecolor-mapped dimension.

  • markeredgecolor_label (str, optional) – An alternate label to use for the markeredgecolor-mapped dimension.

  • markeredgecolor_ticklabels (dict or sequence, optional) – A mapping from values to tick labels to use for the markeredgecolor-mapped dimension.

  • linewidth (str, optional) – If specified, the name of a dimension to use for mapping the line width of each line. If not a dimension, this is used as a constant line width for all lines.

  • linewidths (sequence, optional) – An explicit sequence of line widths to use for the linewidth-mapped dimension.

  • linewidth_order (sequence, optional) – An explicit order of values to use for the linewidth-mapped dimension.

  • linewidth_label (str, optional) – An alternate label to use for the linewidth-mapped dimension.

  • linewidth_ticklabels – A mapping from values to tick labels to use for the linewidth-mapped dimension.

  • linestyle (str, optional) – If specified, the name of a dimension to use for mapping the line style of each line. If not a dimension, this is used as a constant line style for all lines.

  • linestyles (sequence, optional) – An explicit sequence of line styles to use for the linestyle-mapped dimension.

  • linestyle_order (sequence, optional) – An explicit order of values to use for the linestyle-mapped dimension.

  • linestyle_label (str, optional) – An alternate label to use for the linestyle-mapped dimension.

  • linestyle_ticklabels (dict or sequence, optional) – A mapping from values to tick labels to use for the linestyle-mapped dimension.

  • text (str, optional) – If specified, the name of a dimension to use for mapping text annotations to each line.

  • text_formatter (callable, optional) – A function to use to format data entries to text annotations. Default is str.

  • text_opts (dict, optional) – Additional keyword arguments to pass to the text plotting function.

  • col (str, optional) – If specified, the name of a dimension to use for mapping the subplot column of each line.

  • col_order (sequence, optional) – An explicit order of values to use for the col-mapped dimension.

  • col_label (str, optional) – An alternate label to use for the col-mapped dimension.

  • col_ticklabels (dict or sequence, optional) – A mapping from values to tick labels to use for the col-mapped dimension.

  • row (str, optional) – If specified, the name of a dimension to use for mapping the subplot row of each line.

  • row_order (sequence, optional) – An explicit order of values to use for the row-mapped dimension.

  • row_label (str, optional) – An alternate label to use for the row-mapped dimension.

  • row_ticklabels (dict or sequence, optional) – A mapping from values to tick labels to use for the row-mapped dimension.

  • alpha (float, optional) – Global alpha value to use for all lines.

  • join_across_missing (bool, optional) – If True, join lines across missing (NaN) data. Default is False.

  • err_band_alpha (float, optional) – Alpha value to use for error bands.

  • err_bar_capsize (float, optional) – Size of the caps on error bars.

  • xlabel (str, optional) – Alternate label to use for the x-axis.

  • ylabel (str, optional) – Alternate label to use for the y-axis.

  • xlim (tuple, optional) – Limits to use for the x-axis.

  • ylim (tuple, optional) – Limits to use for the y-axis.

  • xscale (str, optional) – Scale to use for the x-axis, e.g. ‘log’.

  • yscale (str, optional) – Scale to use for the y-axis, e.g. ‘log’.

  • zscale (str, optional) – Scale to use for a heatmap color dimension, e.g. ‘log’.

  • xbase (float, optional) – If xscale=='log', the log base to use for the x-axis.

  • ybase (float, optional) – If yscale=='log', the log base to use for the y-axis.

  • xticks (sequence[float], optional) – Manual sequence of x-values to use for ticks.

  • yticks (sequence[float], optional) – Manual sequence of y-values to use for ticks.

  • xticklabels (sequence[str], optional) – Manual sequence of x-tick labels to use, requires and should be the same length as xticks.

  • yticklabels (sequence[str], optional) – Manual sequence of y-tick labels to use, requires and should be the same length as yticks.

  • vspans (sequence[float], optional) – Sequence of x-values to use for vertical spans.

  • hspans (sequence[float], optional) – Sequence of y-values to use for horizontal spans.

  • span_color (str or tuple, optional) – Color to use for spans.

  • span_alpha (float, optional) – Alpha value to use for spans.

  • span_linewidth (float, optional) – Line width to use for spans.

  • span_linestyle (str, optional) – Line style to use for spans.

  • grid (bool, optional) – Whether to show grid lines.

  • grid_which (str, optional) – Which grid lines to show, either ‘major’ or ‘minor’.

  • grid_alpha (float, optional) – Alpha value to use for grid lines.

  • legend (bool, optional) – Whether to show a legend.

  • legend_ncol (int, optional) – Number of columns to use for the legend.

  • legend_merge (bool, optional) – If True, combinations of different mapped properties are merged into list of every combination.

  • legend_reverse (bool, optional) – If True, reverse the order of the legend entries.

  • legend_entries (sequence, optional) – An explicit sequence of legend entries to use.

  • legend_labels (sequence, optional) – An explicit sequence of legend labels to use.

  • legend_extras (sequence, optional) – An explicit sequence of extra legend items to add.

  • legend_opts (dict, optional) – Additional keyword arguments to pass to the legend plotting function.

  • title (str, optional) – A title to use for the plot.

  • axs (sequence[sequence[matplotlib.Axes]], optional) – An explicit array of axes to use for the plot, it should have at least as many rows and columns as there are mapped dimensions.

  • ax (matplotlib.Axes, optional) – Shortcut for supplying a single axes to use for the plot, can only supply if there is a single row and column.

  • format_axs (bool, optional) – Whether to format the axes to use the neutral xyzpy style.

  • figsize (tuple, optional) – Size of the figure to use if creating one (ax is axs is None). If not specified it is automatically computed based on the number of rows and columns.

  • height (float, optional) – Height of each subplot. Default is 3.

  • width (float, optional) – Width of each subplot. If not specified, it is automatically set to match height. Default is None.

  • hspace (float, optional) – Spacing between subplots vertically. Default is 0.12.

  • wspace (float, optional) – Spacing between subplots horizontally. Default is 0.12.

  • sharex (bool, optional) – Whether to share the x-axis between subplots. Default is True.

  • sharey (bool, optional) – Whether to share the y-axis between subplots. Default is True.

  • kwargs (dict, optional) – Additional keyword arguments to pass to the main plotting function.

Returns:

  • fig (matplotlib.Figure) – Figure containing the plot (None if ax or axs is specified).

  • axs (sequence[sequence[matplotlib.Axes]]) – Array of axes containing the plot.

xyzpy.neutral_style(draw_color=(0.5, 0.5, 0.5), **kwargs)[source]
xyzpy.auto_iheatmap(x, **iheatmap_opts)[source]

Auto version of iheatmap() that accepts array arguments by converting them to a Dataset first.

xyzpy.auto_ilineplot(x, y_z, **lineplot_opts)[source]

Auto version of ilineplot() that accepts array arguments by converting them to a Dataset first.

xyzpy.auto_iscatter(x, y_z, **iscatter_opts)[source]

Auto version of iscatter() that accepts array arguments by converting them to a Dataset first.

xyzpy.iheatmap(ds, x, y, z, **kwargs)[source]

From ds plot variable z as a function of x and y using a 2D heatmap. Interactive,

Parameters:
  • ds (xarray.Dataset) – Dataset to plot from.

  • x (str) – Dimension to plot along the x-axis.

  • y (str) – Dimension to plot along the y-axis.

  • z (str, optional) – Variable to plot as colormap.

  • row (str, optional) – Dimension to vary over as a function of rows.

  • col (str, optional) – Dimension to vary over as a function of columns.

  • plot_opts – See xyzpy.plot.core.PLOTTER_DEFAULTS.

xyzpy.ilineplot(ds, x, y, z=None, y_err=None, x_err=None, **kwargs)[source]

From ds plot lines of y as a function of x, optionally for varying z. Interactive,

Parameters:
  • ds (xarray.Dataset) – Dataset to plot from.

  • x (str) – Dimension to plot along the x-axis.

  • y (str or tuple[str]) – Variable(s) to plot along the y-axis. If tuple, plot each of the variables - instead of z.

  • z (str, optional) – Dimension to plot into the page.

  • y_err (str, optional) – Variable to plot as y-error.

  • x_err (str, optional) – Variable to plot as x-error.

  • row (str, optional) – Dimension to vary over as a function of rows.

  • col (str, optional) – Dimension to vary over as a function of columns.

  • plot_opts – See xyzpy.plot.core.PLOTTER_DEFAULTS.

xyzpy.iscatter(ds, x, y, z=None, y_err=None, x_err=None, **kwargs)[source]

From ds plot a scatter of y against x, optionally for varying z. Interactive.

Parameters:
  • ds (xarray.Dataset) – Dataset to plot from.

  • x (str) – Quantity to plot along the x-axis.

  • y (str or tuple[str]) – Quantity(s) to plot along the y-axis. If tuple, plot each of the variables - instead of z.

  • z (str, optional) – Dimension to plot into the page.

  • y_err (str, optional) – Variable to plot as y-error.

  • x_err (str, optional) – Variable to plot as x-error.

  • row (str, optional) – Dimension to vary over as a function of rows.

  • col (str, optional) – Dimension to vary over as a function of columns.

  • plot_opts – See xyzpy.plot.core.PLOTTER_DEFAULTS.

class xyzpy.AutoHeatMap(x, **heatmap_opts)[source]

Bases: HeatMap

class xyzpy.AutoHistogram(x, **histogram_opts)[source]

Bases: Histogram

class xyzpy.AutoLinePlot(x, y_z, **lineplot_opts)[source]

Bases: LinePlot

class xyzpy.AutoScatter(x, y_z, **scatter_opts)[source]

Bases: Scatter

class xyzpy.HeatMap(ds, x, y, z, **kwargs)[source]

Bases: PlotterMatplotlib, xyzpy.plot.core.AbstractHeatMap

plot_heatmap()[source]

Plot the data as a heatmap.

__call__()[source]
class xyzpy.Histogram(ds, x, z=None, **kwargs)[source]

Bases: PlotterMatplotlib, xyzpy.plot.core.AbstractHistogram

plot_histogram()[source]
__call__()[source]
class xyzpy.LinePlot(ds, x, y, z=None, *, y_err=None, x_err=None, **kwargs)[source]

Bases: PlotterMatplotlib, xyzpy.plot.core.AbstractLinePlot

plot_lines()[source]
__call__()[source]
class xyzpy.Scatter(ds, x, y, z=None, **kwargs)[source]

Bases: PlotterMatplotlib, xyzpy.plot.core.AbstractScatter

plot_scatter()[source]
__call__()[source]
xyzpy.auto_heatmap(x, **heatmap_opts)[source]

Auto version of heatmap() that accepts array arguments by converting them to a Dataset first.

xyzpy.auto_histogram(x, **histogram_opts)[source]

Auto version of histogram() that accepts array arguments by converting them to a Dataset first.

xyzpy.auto_lineplot(x, y_z, **lineplot_opts)[source]

Auto version of lineplot() that accepts array arguments by converting them to a Dataset first.

xyzpy.auto_scatter(x, y_z, **scatter_opts)[source]

Auto version of scatter() that accepts array arguments by converting them to a Dataset first.

xyzpy.heatmap(ds, x, y, z, **kwargs)[source]

From ds plot variable z as a function of x and y using a 2D heatmap.

Parameters:
  • ds (xarray.Dataset) – Dataset to plot from.

  • x (str) – Dimension to plot along the x-axis.

  • y (str) – Dimension to plot along the y-axis.

  • z (str, optional) – Variable to plot as colormap.

  • row (str, optional) – Dimension to vary over as a function of rows.

  • col (str, optional) – Dimension to vary over as a function of columns.

  • plot_opts – See xyzpy.plot.core.PLOTTER_DEFAULTS.

xyzpy.histogram(ds, x, z=None, **plot_opts)[source]

Dataset histogram.

Parameters:
  • ds (xarray.Dataset) – The dataset to plot.

  • x (str, sequence of str) – The variable(s) to plot the probability density of. If sequence, plot a histogram of each instead of using a z coordinate.

  • z (str, optional) – If given, range over this coordinate a plot a histogram for each.

  • row (str, optional) – Dimension to vary over as a function of rows.

  • col (str, optional) – Dimension to vary over as a function of columns.

  • plot_opts – See xyzpy.plot.core.PLOTTER_DEFAULTS.

xyzpy.lineplot(ds, x, y, z=None, y_err=None, x_err=None, **plot_opts)[source]

From ds plot lines of y as a function of x, optionally for varying z.

Parameters:
  • ds (xarray.Dataset) – Dataset to plot from.

  • x (str) – Dimension to plot along the x-axis.

  • y (str or tuple[str]) – Variable(s) to plot along the y-axis. If tuple, plot each of the variables - instead of z.

  • z (str, optional) – Dimension to plot into the page.

  • y_err (str, optional) – Variable to plot as y-error.

  • x_err (str, optional) – Variable to plot as x-error.

  • row (str, optional) – Dimension to vary over as a function of rows.

  • col (str, optional) – Dimension to vary over as a function of columns.

  • plot_opts – See xyzpy.plot.core.PLOTTER_DEFAULTS.

xyzpy.scatter(ds, x, y, z=None, y_err=None, x_err=None, **plot_opts)[source]

From ds plot a scatter of y against x, optionally for varying z.

Parameters:
  • ds (xarray.Dataset) – Dataset to plot from.

  • x (str) – Quantity to plot along the x-axis.

  • y (str or tuple[str]) – Quantity(s) to plot along the y-axis. If tuple, plot each of the variables - instead of z.

  • z (str, optional) – Dimension to plot into the page.

  • y_err (str, optional) – Variable to plot as y-error.

  • x_err (str, optional) – Variable to plot as x-error.

  • row (str, optional) – Dimension to vary over as a function of rows.

  • col (str, optional) – Dimension to vary over as a function of columns.

  • plot_opts – See xyzpy.plot.core.PLOTTER_DEFAULTS.

xyzpy.visualize_matrix(array, max_mag=None, magscale='linear', alpha_map=True, alpha_pow=1 / 2, legend=True, legend_loc='auto', legend_size=0.15, legend_bounds=None, legend_resolution=3, facecolor=None, rasterize=4096, rasterize_dpi=300, figsize=(5, 5), ax=None)[source]

Visualize array as a 2D colormapped image.

Parameters:
  • array (array_like or Sequence[array_like]) – A 2D (or 1D) array or sequence of arrays to visualize.

  • max_mag (float, optional) – The maximum magnitude to use for the color mapping. If not provided, the maximum magnitude in the array will be used.

  • magscale ("linear" or float, optional) – How to scale the magnitude of the array values. If “linear”, then the magnitude is used directly. If a float, then the magnitude is raised to this power before being used, which can help to show variation among small values.

  • alpha_map (bool, optional) – Whether to map the tensor value magnitudes to pixel alpha.

  • alpha_pow (float, optional) – The power to raise the magnitude to when mapping to alpha.

  • legend (bool, optional) – Whether to show a legend (colorbar). If the array has complex dtype then the legend will be a colorwheel.

  • legend_loc (str or tuple[float], optional) – Where to place the legend. If “auto”, then the legend will be placed outside the plot rectangle, otherwise it should be a tuple of (x, y) coordinates in axes space.

  • legend_size (float, optional) – The size of the legend, in relation to the size of the plot axes.

  • legend_bounds (tuple[float], optional) – The bounds of the legend, as (x, y, width, height) in axes space. If not provided, the bounds will be computed from legend_loc and legend_size.

  • legend_resolution (int, optional) – The number of different colors to show in the legend.

  • facecolor (str, optional) – The background color of the plot, by default transparent.

  • rasterize (int or float, optional) – Whether to rasterize the plot. If a number, then rasterize if the number of pixels in the plot is greater than this value.

  • rasterize_dpi (float, optional) – The dpi to use when rasterizing.

  • figsize (tuple[float], optional) – The size of the figure to create, if ax is not provided.

  • ax (matplotlib.Axis, optional) – The axis to draw to. If not provided, a new figure will be created.

  • show_and_close (bool, optional) – If True (the default) then show and close the figure, otherwise return the figure and axis.

Returns:

  • fig (matplotlib.Figure) – The figure containing the plot, or None if ax was provided.

  • ax (matplotlib.Axis) – The axis or axes containing the plot(s).

xyzpy.visualize_tensor(array, spacing_factor=1.0, max_projections=None, projection_overlap_spacing=1.05, angles=None, scales=None, skew_angle_factor='auto', skew_scale_factor=0.05, max_mag=None, magscale='linear', size_map=True, size_pow=1 / 2, size_scale=1.0, alpha_map=True, alpha_pow=1 / 2, alpha=0.8, marker='o', linewidths=0, show_lattice=True, lattice_opts=None, compass=False, compass_loc='auto', compass_size=0.1, compass_bounds=None, compass_labels=None, compass_opts=None, legend=True, legend_loc='auto', legend_size=0.15, legend_bounds=None, legend_resolution=3, interleave_projections=False, reverse_projections=False, facecolor=None, rasterize=4096, rasterize_dpi=300, figsize=(5, 5), ax=None)[source]

Visualize all entries of a tensor, with indices mapped into the plane and values mapped into a color wheel.

Parameters:
  • array (numpy.ndarray) – The tensor to visualize.

  • spacing_factor (float, optional) – How to scale the dimensions relative to each other. If 1.0, then each dimension will have the same extent, and smaller dimensions will be sparser. If 0.0, the each dimension will have an extent propoertional to its size, with matching density.

  • max_projections (int, optional) – The maximum number of different projection directions / angles to use. If specified and less than the number of dimensions, then multiple dimensions will share the same angle but with different scales.

  • projection_overlap_spacing (float, optional) – When grouping multiple dimensions to the same angle, how much to increase the spacing at each scale so as to emphasize each.

  • angles (sequence[float], optional) – An explicit list of angles to use for each direction, in radians, with zero pointing straight down. If not provided, then the angles will be calculated automatically.

  • scales (sequence[float], optional) – An explicit list of scales to use for each direction. If not provided, then the scales will be calculated automatically.

  • skew_angle_factor (float, optional) – When there are more than two dimensions, a factor to scale the rotations by to avoid overlapping data points. If 0.0 then the angles will be evenly spaced.

  • skew_scale_factor (float, optional) – When there are more than two dimensions, a factor to scale the scales by to avoid overlapping data points, that shortens non-perpendicular directions.

  • max_mag (float, optional) – The maximum magnitude to use for the color mapping. If not provided, the maximum magnitude in the array will be used.

  • magscale ("linear" or float, optional) – How to scale the magnitude of the array values. If “linear”, then the magnitude is used directly. If a float, then the magnitude is raised to this power before being used, which can help to show variation among small values.

  • size_map (bool, optional) – Whether to map the tensor value magnitudes to marker size.

  • size_scale (float, optional) – An overall factor to scale the marker size by.

  • alpha_map (bool, optional) – Whether to map the tensor value magnitudes to marker alpha.

  • alpha_pow (float, optional) – The power to raise the magnitude to when mapping to alpha.

  • alpha (float, optional) – The overall alpha to use for all markers if not alpha_map.

  • marker (str, optional) – The marker to use for the markers.

  • linewidths (float, optional) – The linewidth to use for the markers.

  • show_lattice (bool, optional) – Show a thin grey line connecting adjacent array coordinate points.

  • lattice_opts (dict, optional) – Options to pass to maplotlib.Axis.scatter for the lattice grid.

  • compass (bool, optional) – Whether to show a compass indicating the orientation of each dimension.

  • compass_loc ((float, float), optional) – Where to place the compass.

  • compass_size (float, optional) – The size of the compass.

  • compass_bounds (tuple[float], optional) – Explicit bounds of the compass, as (x, y, width, height).

  • compass_labels (sequence[str], optional) – Explicit labels for the compass, in order of the dimensions.

  • compass_opts (dict, optional) – Extra options for the compass arrows.

  • legend (bool, optional) – Whether to show a legend (colorbar). If the array has complex dtype then the legend will be a colorwheel.

  • legend_loc (str or tuple[float], optional) – Where to place the legend. If “auto”, then the legend will be placed outside the plot rectangle, otherwise it should be a tuple of (x, y) coordinates in axes space.

  • legend_size (float, optional) – The size of the legend, in relation to the size of the plot axes.

  • legend_bounds (tuple[float], optional) – Explicit bounds of the legend, as (x, y, width, height) in axes space.

  • legend_resolution (int, optional) – The number of different colors to show in the legend.

  • interleave_projections (bool, optional) – If True and grouping dimensions, then they are assigned round robin fashion rather than blocks. False matches the behavior of fusing.

  • reverse_projections (bool, optional) – Whether to reverse the order of the projections.

  • facecolor (str, optional) – The background color of the plot, by default transparent.

  • rasterize (int or float, optional) – Whether to rasterize the plot. If a number, then rasterize if the size of the array is greater than this value.

  • rasterize_dpi (float, optional) – The dpi to use when rasterizing.

  • figsize (tuple, optional) – The size of the figure to create, if ax is not provided.

  • ax (matplotlib.Axis, optional) – The axis to draw to. If not provided, a new figure will be created.

Returns:

  • fig (matplotlib.Figure) – The figure containing the plot, or None if ax was provided.

  • ax (matplotlib.Axis) – The axis containing the plot.

class xyzpy.Benchmarker(kernels, setup=None, names=None, benchmark_opts=None, data_name=None)[source]

Compare the performance of various kernels. Internally this makes use of benchmark(), Harvester() and xyzpys plotting functionality.

Parameters:
  • kernels (sequence of callable) – The functions to compare performance with.

  • setup (callable, optional) – If given, setup each benchmark run by suppling the size argument n to this function first, then feeding its output to each of the functions.

  • names (sequence of str, optional) – Alternate names to give the function, else they will be inferred.

  • benchmark_opts (dict, optional) – Supplied to benchmark().

  • data_name (str, optional) – If given, the file name the internal harvester will use to store results persistently.

harvester

The harvester that runs and accumulates all the data.

Type:

xyz.Harvester

ds

Shortcut to the harvester’s full dataset.

Type:

xarray.Dataset

kernels
names
setup = None
benchmark_opts
runner
harvester
run(ns, kernels=None, **harvest_opts)[source]

Run the benchmarks. Each run accumulates rather than overwriting the results.

Parameters:
  • ns (sequence of int or int) – The sizes to run the benchmarks with.

  • kernels (sequence of str, optional) – If given, only run the kernels with these names.

  • harvest_opts – Supplied to harvest_combos().

property ds
plot(**plot_opts)[source]

Plot the benchmarking results.

lineplot(**plot_opts)[source]

Plot the benchmarking results.

ilineplot(**plot_opts)[source]

Interactively plot the benchmarking results.

class xyzpy.MemoryMonitor(interval: float = 0.1)[source]

Monitor this process’ peak memory usage with specified sampling interval in a daemon thread. This is intended to be used as a context manager for long running and memory intensive processes, not fine grained memory tracking.

Parameters:

interval (float, optional) – Time between memory measurements in seconds. Fluctuations in peak memory between measurements might not be captured.

interval

Time between memory measurements in seconds.

Type:

float

peak

The peak memory usage in gigabytes.

Type:

float

interval = 0.1
peak = None
is_running = False
monitor_thread = None
_monitor()[source]
start()[source]

Start the memory monitoring thread.

stop()[source]

Stop the memory monitoring thread.

__enter__()[source]
__exit__(exc_type, exc_value, traceback)[source]
__del__()[source]
__repr__()[source]
class xyzpy.RunningCovariance[source]

Running covariance class.

count = 0
xmean = 0.0
ymean = 0.0
C = 0.0
update(x, y)[source]
update_from_it(xs, ys)[source]
property covar

The covariance.

property sample_covar

The covariance with “Bessel’s correction”.

class xyzpy.RunningCovarianceMatrix(n=2)[source]

Running covariance matrix for n variables.

Parameters:

n (int, optional) – Number of variables to track.

n = 2
rcs
update(*x)[source]

Update the covariance matrix with a single observation.

update_from_it(*xs)[source]

Update from iterables of observations for each variable.

property count

Return the number of samples accumulated.

property covar_matrix

Return the population covariance matrix.

property sample_covar_matrix

Return the sample covariance matrix.

to_uncertainties(bias=True)[source]

Convert the accumulated statistics to correlated uncertainties, from which new quantities can be calculated with error automatically propagated.

Parameters:

bias (bool, optional) – If False, use the sample covariance with “Bessel’s correction”.

Returns:

values – The sequence of correlated variables.

Return type:

tuple of uncertainties.ufloat

Examples

Estimate quantities of two perfectly correlated sequences.

>>> rcm = xyz.RunningCovarianceMatrix()
>>> rcm.update_from_it((1, 3, 2), (2, 6, 4))
>>> x, y = rcm.to_uncertainties(rcm)

Calculated quantities like sums have the error propagated:

>>> x + y
6.0+/-2.4494897427831783

But the covariance is also taken into account, meaning the ratio here can be estimated with zero error:

>>> x / y
0.5+/-0
class xyzpy.RunningStatistics[source]

Running mean & standard deviation using Welford’s algorithm. This is a very efficient way of keeping track of the error on the mean for example.

mean

Current mean.

Type:

float

count

Current count.

Type:

int

std

Current standard deviation.

Type:

float

var

Current variance.

Type:

float

err

Current error on the mean.

Type:

float

rel_err

The current relative error.

Type:

float

Examples

>>> rs = RunningStatistics()
>>> rs.update(1.1)
>>> rs.update(1.4)
>>> rs.update(1.2)
>>> rs.update_from_it([1.5, 1.3, 1.6])
>>> rs.mean
1.3499999046325684
>>> rs.std  # standard deviation
0.17078252585383266
>>> rs.err  # error on the mean
0.06972167422092768
count = 0
mean = 0.0
M2 = 0.0
update(x)[source]

Add a single value x to the statistics.

update_from_it(xs)[source]

Add all values from iterable xs to the statistics.

converged(rtol, atol)[source]

Check if the stats have converged with respect to relative and absolute tolerance rtol and atol.

property var
property std
property err
property rel_err
__repr__()[source]
class xyzpy.Timer[source]

A very simple context manager class for timing blocks.

Examples

>>> from xyzpy import Timer
>>> with Timer() as timer:
...     print('Doing some work!')
...
Doing some work!
>>> timer.t
0.00010752677917480469
__enter__()[source]
__exit__(*args)[source]
xyzpy.benchmark(fn, setup=None, n=None, min_t=0.2, repeats=2, get='min', starmap=False)[source]

Benchmark the time it takes to run fn.

Parameters:
  • fn (callable) – The function to time.

  • setup (callable, optional) – If supplied the function that sets up the argument for fn.

  • n (int, optional) – If supplied, the integer to supply to setup of fn.

  • min_t (float, optional) – Aim to repeat function enough times to take up this many seconds.

  • repeats (int, optional) – Repeat the whole procedure (with setup) this many times in order to take the minimum run time.

  • get ({'min', 'mean'}, optional) – Return the minimum or mean time for each run.

  • starmap (bool, optional) – Unpack the arguments from setup, if given.

Returns:

t – The minimum, averaged, time to run fn in seconds.

Return type:

float

Examples

Just a parameter-less function:

>>> import xyzpy as xyz
>>> import numpy as np
>>> xyz.benchmark(lambda: np.linalg.eig(np.random.randn(100, 100)))
0.004726233000837965

The same but with a setup and size parameter n specified:

>>> setup = lambda n: np.random.randn(n, n)
>>> fn = lambda X: np.linalg.eig(X)
>>> xyz.benchmark(fn, setup, 100)
0.0042192734545096755
xyzpy.estimate_from_repeats(fn, *fn_args, rtol=0.02, tol_scale=1.0, get='stats', verbosity=0, min_samples=5, max_samples=1000000, **fn_kwargs)[source]
Parameters:
  • fn (callable) – The function that estimates a single value.

  • fn_args – Supplied to fn.

  • optional – Supplied to fn.

  • rtol (float, optional) – Relative tolerance for error on mean.

  • tol_scale (float, optional) – The expected ‘scale’ of the estimate, this modifies the aboslute tolerance near zero to rtol * tol_scale, default: 1.0.

  • get ({'stats', 'samples', 'mean'}, optional) – Just get the RunningStatistics object, or the actual samples too, or just the actual mean estimate.

  • verbosity ({ 0, 1, 2}, optional) –

    How much information to show:

    • 0: nothing

    • 1: progress bar just with iteration rate,

    • 2: progress bar with running stats displayed.

  • min_samples (int, optional) – Take at least this many samples before checking for convergence.

  • max_samples (int, optional) – Take at maximum this many samples.

  • fn_kwargs – Supplied to fn.

  • optional – Supplied to fn.

Returns:

  • rs (RunningStatistics) – Statistics about the random estimation.

  • samples (list[float]) – If get=='samples', the actual samples.

Examples

Estimate the sum of n random numbers:

>>> import numpy as np
>>> import xyzpy as xyz
>>> def fn(n):
...     return np.random.rand(n).sum()
...
>>> stats = xyz.estimate_from_repeats(fn, n=10, verbosity=3)
59: 5.13(12): : 58it [00:00, 3610.84it/s]
RunningStatistics(mean=5.13(12), count=59)
xyzpy.format_number_with_error(x, err)[source]

Given x with error err, format a string showing the relevant digits of x with two significant digits of the error bracketed, and overall exponent if necessary.

Parameters:
  • x (float) – The value to print.

  • err (float) – The error on x.

Return type:

str

Examples

>>> print_number_with_uncertainty(0.1542412, 0.0626653)
'0.154(63)'
>>> print_number_with_uncertainty(-128124123097, 6424)
'-1.281241231(64)e+11'
xyzpy.get_peak_memory_usage()[source]

Get the peak memory usage of the current process in gigabytes. This uses the psutil package on Windows, and the resource package on Linux and macOS.

xyzpy.getsizeof(obj)[source]

Compute the real size of a Python object in bytes, taken from https://stackoverflow.com/a/30316760/5640201.

Parameters:

obj (object) – Object to measure.

Returns:

Total size in bytes.

Return type:

int

xyzpy.progbar(it=None, nb=False, **kwargs)[source]

Turn any iterable into a progress bar, with notebook option

Parameters:
  • it (iterable) – Iterable to wrap with progress bar

  • nb (bool) – Whether to display the notebook progress bar

  • **kwargs (dict-like) – additional options to send to tqdm

xyzpy.report_memory()[source]

Return a formatted memory usage summary for the current process.

xyzpy.report_memory_gpu()[source]

Return a formatted GPU memory usage summary for the process.

xyzpy.unzip(its, zip_level=1)[source]

Split a nested iterable at a specified level, i.e. in numpy language transpose the specified ‘axis’ to be the first.

Parameters:
  • its (iterable (of iterables (of iterables ...))) – ‘n-dimensional’ iterable to split

  • zip_level (int) – level at which to split the iterable, default of 1 replicates zip(*its) behaviour.

Example

>>> x = [[(1, True), (2, False), (3, True)],
         [(7, True), (8, False), (9, True)]]
>>> nums, bools = unzip(x, 2)
>>> nums
((1, 2, 3), (7, 8, 9))
>>> bools
((True, False, True), (True, False, True))