Basics#

The idea of xyzpy is to ease the some of the pain generating data with a large parameter space. The central aim being that, once you know what a single run of a function looks like, it should be as easy as saying, “run these combinations of parameters, now run these particular cases” with everything automatically aggregated into a fully self-described dataset.

%config InlineBackend.figure_formats = ['svg']

import xyzpy as xyz
import numpy as np

Combos & Cases#

The main backend function is xyz.combo_runner, which in its simplest form takes a function, say:

def foo(a, b, c):
    return f"{a}-{b}-{c}", np.sin(a)

and combos of the form:

combos = [
    ('a', [1, 2, 3]),
    ('b', ['x', 'y', 'z']),
    ('c', [True, False]),
]

and generates a nested (here 3 dimensional) array of all the outputs of foo with the 3 * 3 * 2 = 18 combinations of input arguments:

xyz.combo_runner(foo, combos)

100%|##########| 18/18 [00:00<00:00, 156959.40it/s]

(((('1-x-True', 0.8414709848078965), ('1-x-False', 0.8414709848078965)),
  (('1-y-True', 0.8414709848078965), ('1-y-False', 0.8414709848078965)),
  (('1-z-True', 0.8414709848078965), ('1-z-False', 0.8414709848078965))),
 ((('2-x-True', 0.9092974268256817), ('2-x-False', 0.9092974268256817)),
  (('2-y-True', 0.9092974268256817), ('2-y-False', 0.9092974268256817)),
  (('2-z-True', 0.9092974268256817), ('2-z-False', 0.9092974268256817))),
 ((('3-x-True', 0.1411200080598672), ('3-x-False', 0.1411200080598672)),
  (('3-y-True', 0.1411200080598672), ('3-y-False', 0.1411200080598672)),
  (('3-z-True', 0.1411200080598672), ('3-z-False', 0.1411200080598672))))

Note the progress bar shown. If the function was slower (generally the target case for xyzpy), this would show the remaining time before completion.

There is also xyz.case_runner for running isolated cases:

cases = [(4, 'z', False), (5, 'y', True)]
xyz.case_runner(foo, fn_args=('a', 'b', 'c'), cases=cases)

100%|##########| 2/2 [00:00<00:00, 26214.40it/s]

(('4-z-False', -0.7568024953079282), ('5-y-True', -0.9589242746631385))

You can also mix the two, supplying some function arguments as cases and some as combos. In this situation, for each case, all sub combinations are run:

xyz.combo_runner(
    foo,
    cases=[
        {'a': 1, 'c': True},
        {'a': 2, 'c': False},
        {'a': 3, 'c': True},
    ],
    combos={
        'b': ['x', 'y', 'z'],
    },
)

100%|##########| 9/9 [00:00<00:00, 96297.80it/s]

((((array(nan), array(nan)),
   (array(nan), array(nan)),
   (array(nan), array(nan))),
  (('1-x-True', 0.8414709848078965),
   ('1-y-True', 0.8414709848078965),
   ('1-z-True', 0.8414709848078965))),
 ((('2-x-False', 0.9092974268256817),
   ('2-y-False', 0.9092974268256817),
   ('2-z-False', 0.9092974268256817)),
  ((array(nan), array(nan)),
   (array(nan), array(nan)),
   (array(nan), array(nan)))),
 (((array(nan), array(nan)),
   (array(nan), array(nan)),
   (array(nan), array(nan))),
  (('3-x-True', 0.1411200080598672),
   ('3-y-True', 0.1411200080598672),
   ('3-z-True', 0.1411200080598672))))

Note now that for the combo_runner missing results are automatically filled with nan, (or possibly None depending on shape and dtype). Note we also avoided specifying the specific function argument order by supplying a list of dicts. You can supply both combos and cases to either combo_runner, or case_runner, the main difference is

combo_runner outputs a nested tuple suitable to be turned into an array
case_runner outputs a flat tuple of results suitable to be put into a table

You will likely not use these functions in their raw form, but they illustrate the concept of combos and cases and underly most other functionality.

Aggregating Random samples of data - `Sampler`#

Occasionally, exhaustively iterating through all combinations of arguments is unneccesary. If instead you just want to sample the parameter space sparsely then the Sampler object allows this with much the same interface as a Harvester. The main difference is that, since the parameters are no longer gridded, the data is stored as a table in a pandas.DataFrame.

import math
import random

@xyz.label(var_names=['out'])
def trig(amp, fn, x, phase):
    return amp * getattr(math, fn)(x - phase)

# these are the default combos/distributions to sample from
default_combos = {
    'amp': [1, 2, 3],
    'fn': ['cos', 'sin'],
    # for distributions we can supply callables
    'x': lambda: 2 * math.pi * random.random(),
    'phase': lambda: random.gauss(0.0, 0.1),
}

sampler = xyz.Sampler(trig, 'trig.pkl', default_combos)
sampler

<xyzpy.Sampler>
Runner: <xyzpy.Runner>
    fn: <function trig at 0x7f33df575760>
    fn_args: ('amp', 'fn', 'x', 'phase')
    var_names: ('out',)
    var_dims: {'out': ()}
Sync file -->
    trig.pkl    [pickle]

Now we can run the sampler many times (and supply any of the usual arguments such as parallel=True etc). This generates a pandas.DataFrame:

sampler.sample_combos(10000);

100%|##########| 10000/10000 [00:00<00:00, 448780.65it/s]

This has also synced the data with the on-disk file:

!ls *.pkl

trig.pkl

You can specify Sampler(..., engine='csv') etc to use formats other than pickle.

As with the Harvester, next time we run combinations, the data is automatically aggregated into the full set:

# here we will override some of the default sampling choices
combos = {
    'fn': ['tan'],
    'x': lambda: random.random() * math.pi / 4
}

sampler.sample_combos(5000, combos);

100%|##########| 5000/5000 [00:00<00:00, 488527.77it/s]

We can then use tools such as seaborn to visualize the full data:

import seaborn as sns

sns.relplot(x='x', y='out', hue='fn', size='amp', data=sampler.full_df)

<seaborn.axisgrid.FacetGrid at 0x7f33df1e4910>

_images/8a976991f6c642d991d58bbad380ab2bd0ecc06818470c125de29c361bd10f35.svg

Hint

As a convenience, xyzpy.label() can also be used to decorate a function as a xyzpy.Sampler by supplying the sampler kwarg. If True a sampler will be instantiated with data_name=None. If a string, it is used as the data_name.

Summary#

combo_runner() is the core function which outputs a nested tuple and contains the parallelization logic and progress display etc.
Runner and run_combos() are used to describe the function’s output and perform a single set of runs yielding a xarray.Dataset. These internally call combo_runner().
Harvester and harvest_combos() are used to perform many sets of runs, continuously merging the results into one larger xarray.Dataset - Harvester.full_ds, probably synced to disk. These internally call run_combos().
Sampler and sample_combos() are used to sparsely sample from parameter combinations. Unlike a normal Harvester, the data is aggregated automatically into a pandas.DataFrame.

In general, you would only generate data with one of these methods at once - see the full demonstrations in Examples.

# some cleanup
harvester.delete_ds()
sampler.delete_df()

Basics#

Combos & Cases#

Describing the function - `Runner`#

Aggregating data - `Harvester`#

Aggregating Random samples of data - `Sampler`#

Summary#

Basics#

Combos & Cases#

Describing the function - Runner#

Aggregating data - Harvester#

Aggregating Random samples of data - Sampler#

Summary#

Describing the function - `Runner`#

Aggregating data - `Harvester`#

Aggregating Random samples of data - `Sampler`#