Generating Data
Contents
1. Generating Data#
The idea of xyzpy
is to ease the some of the pain generating data with a large parameter space.
The central aim being that, once you know what a single run of a function looks like, it should be as easy as saying, “run these combinations of parameters, now run these particular cases” with everything automatically aggregated into a fully self-described dataset.
[1]:
%matplotlib inline
%config InlineBackend.figure_formats = ['svg']
from xyzpy import *
import numpy as np
1.1. Combos & Cases#
The main backend function is combo_runner()
, which in its simplest form takes a function, say:
[2]:
def foo(a, b, c):
return f"{a}-{b}-{c}", np.sin(a)
and combos
of the form:
[3]:
combos = [
('a', [1, 2, 3]),
('b', ['x', 'y', 'z']),
('c', [True, False]),
]
and generates a nested (here 3 dimensional) array of all the outputs of foo
with the 3 * 3 * 2 = 18
combinations of input arguments:
[4]:
combo_runner(foo, combos)
100%|##########| 18/18 [00:00<00:00, 114563.69it/s]
[4]:
(((('1-x-True', 0.8414709848078965), ('1-x-False', 0.8414709848078965)),
(('1-y-True', 0.8414709848078965), ('1-y-False', 0.8414709848078965)),
(('1-z-True', 0.8414709848078965), ('1-z-False', 0.8414709848078965))),
((('2-x-True', 0.9092974268256817), ('2-x-False', 0.9092974268256817)),
(('2-y-True', 0.9092974268256817), ('2-y-False', 0.9092974268256817)),
(('2-z-True', 0.9092974268256817), ('2-z-False', 0.9092974268256817))),
((('3-x-True', 0.1411200080598672), ('3-x-False', 0.1411200080598672)),
(('3-y-True', 0.1411200080598672), ('3-y-False', 0.1411200080598672)),
(('3-z-True', 0.1411200080598672), ('3-z-False', 0.1411200080598672))))
Note the progress bar shown. If the function was slower (generally the target case for xyzpy
), this would show the remaining time before completion.
There is also case_runner()
for running isolated cases:
[5]:
cases = [(4, 'z', False), (5, 'y', True)]
case_runner(foo, fn_args=('a', 'b', 'c'), cases=cases)
100%|##########| 2/2 [00:00<00:00, 18236.10it/s]
[5]:
(('4-z-False', -0.7568024953079282), ('5-y-True', -0.9589242746631385))
You can also mix the two, supplying some function arguments as cases
and some as combos
. In this situation, for each case, all sub combinations are run:
[6]:
combo_runner(
foo,
cases=[
{'a': 1, 'c': True},
{'a': 2, 'c': False},
{'a': 3, 'c': True},
],
combos={
'b': ['x', 'y', 'z'],
},
)
100%|##########| 9/9 [00:00<00:00, 85019.68it/s]
[6]:
((((array(nan), array(nan)),
(array(nan), array(nan)),
(array(nan), array(nan))),
(('1-x-True', 0.8414709848078965),
('1-y-True', 0.8414709848078965),
('1-z-True', 0.8414709848078965))),
((('2-x-False', 0.9092974268256817),
('2-y-False', 0.9092974268256817),
('2-z-False', 0.9092974268256817)),
((array(nan), array(nan)),
(array(nan), array(nan)),
(array(nan), array(nan)))),
(((array(nan), array(nan)),
(array(nan), array(nan)),
(array(nan), array(nan))),
(('3-x-True', 0.1411200080598672),
('3-y-True', 0.1411200080598672),
('3-z-True', 0.1411200080598672))))
Note now that for the combo_runner
missing results are automatically filled with nan
, (or possibly None
depending on shape and dtype).
Note we also avoided specifying the specific function argument order by supplying a list of dicts.
You can supply both combos
and cases
to either combo_runner()
, or case_runner()
, the main difference is
combo_runner()
outputs a nested tuple suitable to be turned into an arraycase_runner()
outputs a flat tuple of results suitable to be put into a table
You will likely not use these functions in their raw form, but they illustrate the concept of combos
and cases
and underly most other functionality.
1.2. Describing the function - Runner
#
To automatically put the generated data into a labelled xarray.Dataset
you need to describe your function using the Runner
class. In the simplest case this is just a matter of naming the outputs:
[7]:
runner = Runner(foo, var_names=['a_out', 'b_out'])
runner.run_combos(combos)
100%|##########| 18/18 [00:00<00:00, 135786.82it/s]
[7]:
<xarray.Dataset> Dimensions: (a: 3, b: 3, c: 2) Coordinates: * a (a) int64 1 2 3 * b (b) <U1 'x' 'y' 'z' * c (c) bool True False Data variables: a_out (a, b, c) <U9 '1-x-True' '1-x-False' ... '3-z-True' '3-z-False' b_out (a, b, c) float64 0.8415 0.8415 0.8415 ... 0.1411 0.1411 0.1411
- a: 3
- b: 3
- c: 2
- a(a)int641 2 3
array([1, 2, 3])
- b(b)<U1'x' 'y' 'z'
array(['x', 'y', 'z'], dtype='<U1')
- c(c)boolTrue False
array([ True, False])
- a_out(a, b, c)<U9'1-x-True' ... '3-z-False'
array([[['1-x-True', '1-x-False'], ['1-y-True', '1-y-False'], ['1-z-True', '1-z-False']], [['2-x-True', '2-x-False'], ['2-y-True', '2-y-False'], ['2-z-True', '2-z-False']], [['3-x-True', '3-x-False'], ['3-y-True', '3-y-False'], ['3-z-True', '3-z-False']]], dtype='<U9')
- b_out(a, b, c)float640.8415 0.8415 ... 0.1411 0.1411
array([[[0.84147098, 0.84147098], [0.84147098, 0.84147098], [0.84147098, 0.84147098]], [[0.90929743, 0.90929743], [0.90929743, 0.90929743], [0.90929743, 0.90929743]], [[0.14112001, 0.14112001], [0.14112001, 0.14112001], [0.14112001, 0.14112001]]])
The output dataset is also stored in runner.last_ds
and, as can be seen, is completely labelled - see xarray for details of the myriad functionality this allows. See also the Basic Output Example for a more complete example.
Hint
As a convenience, label()
can be used to decorate a function, turning it
directly into a Runner
like so:
@label(var_names=['sum', 'diff'])
def foo(x, y):
return x + y, x - y
Various other arguments to Runner
allow: 1) constant arguments to be specified, 2) for each variable to have its own dimensions and 3) to specify the coordinates of those dimensions.
For example, imagine we have a function bar
with signature:
"bar(i, j, k, t) -> (A, B[x], C[x, t])"
Maybe i, j, k
index a location and t
is a (constant) series of times to compute. There are 3 outputs: (i) the scalar A
, (ii) the vector B
which has a dimension x
with known coordinates, say [10, 20, 30]
, and (iii) the 2D-array C
, which shares the x
dimension but also depends on t
. The arguments to Runner
to describe this situation would be:
[8]:
var_names = ['A', 'B', 'C']
var_dims = {'B': ['x'], 'C': ['x', 't']}
var_coords = {'x': [10, 20, 30]}
constants = {'t': np.linspace(0, 1, 101)}
Note that 't'
doesn’t need to be specified in var_coords
as it can be found in constants
. Let’s explicitly mock a function with this signature and some combos to run:
[9]:
def bar(i, j, k, t):
A = np.random.rand()
B = np.random.rand(3) # 'B[x]'
C = np.random.rand(3, len(t)) # 'C[x, t]'
return A, B, C
# if we are using a runner, combos can be supplied as a dict
combos = {
'i': [5, 6, 7],
'j': [0.5, 0.6, 0.7],
'k': [0.05, 0.06, 0.07],
}
We can then run the combos:
[10]:
r = Runner(bar, constants=constants,
var_names=var_names,
var_coords=var_coords,
var_dims=var_dims)
r.run_combos(combos)
100%|##########| 27/27 [00:00<00:00, 62291.64it/s]
[10]:
<xarray.Dataset> Dimensions: (i: 3, j: 3, k: 3, x: 3, t: 101) Coordinates: * i (i) int64 5 6 7 * j (j) float64 0.5 0.6 0.7 * k (k) float64 0.05 0.06 0.07 * x (x) int64 10 20 30 * t (t) float64 0.0 0.01 0.02 0.03 0.04 ... 0.96 0.97 0.98 0.99 1.0 Data variables: A (i, j, k) float64 0.09133 0.001715 0.5488 ... 0.4148 0.8277 0.07938 B (i, j, k, x) float64 0.8565 0.492 0.9438 ... 0.2236 0.9373 0.6762 C (i, j, k, x, t) float64 0.6665 0.5983 0.1811 ... 0.1725 0.5512
- i: 3
- j: 3
- k: 3
- x: 3
- t: 101
- i(i)int645 6 7
array([5, 6, 7])
- j(j)float640.5 0.6 0.7
array([0.5, 0.6, 0.7])
- k(k)float640.05 0.06 0.07
array([0.05, 0.06, 0.07])
- x(x)int6410 20 30
array([10, 20, 30])
- t(t)float640.0 0.01 0.02 ... 0.98 0.99 1.0
array([0. , 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1 , 0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.2 , 0.21, 0.22, 0.23, 0.24, 0.25, 0.26, 0.27, 0.28, 0.29, 0.3 , 0.31, 0.32, 0.33, 0.34, 0.35, 0.36, 0.37, 0.38, 0.39, 0.4 , 0.41, 0.42, 0.43, 0.44, 0.45, 0.46, 0.47, 0.48, 0.49, 0.5 , 0.51, 0.52, 0.53, 0.54, 0.55, 0.56, 0.57, 0.58, 0.59, 0.6 , 0.61, 0.62, 0.63, 0.64, 0.65, 0.66, 0.67, 0.68, 0.69, 0.7 , 0.71, 0.72, 0.73, 0.74, 0.75, 0.76, 0.77, 0.78, 0.79, 0.8 , 0.81, 0.82, 0.83, 0.84, 0.85, 0.86, 0.87, 0.88, 0.89, 0.9 , 0.91, 0.92, 0.93, 0.94, 0.95, 0.96, 0.97, 0.98, 0.99, 1. ])
- A(i, j, k)float640.09133 0.001715 ... 0.8277 0.07938
array([[[0.09133295, 0.00171452, 0.54876152], [0.16001745, 0.1308853 , 0.54782229], [0.09096679, 0.44818879, 0.35799632]], [[0.14534629, 0.70577796, 0.41453631], [0.85434171, 0.60487691, 0.61063343], [0.21434456, 0.02438398, 0.69206189]], [[0.67089335, 0.82259982, 0.80256577], [0.7569435 , 0.65404102, 0.46472882], [0.4147773 , 0.82773316, 0.07937535]]])
- B(i, j, k, x)float640.8565 0.492 ... 0.9373 0.6762
array([[[[0.85650203, 0.49203821, 0.94380493], [0.31062204, 0.72242144, 0.60453701], [0.14725873, 0.21323979, 0.14699458]], [[0.25175089, 0.84229317, 0.97344031], [0.71256699, 0.71192186, 0.72602673], [0.82803735, 0.05875749, 0.82610818]], [[0.25557764, 0.19355599, 0.58597898], [0.08790534, 0.53706147, 0.72665514], [0.13914959, 0.70894014, 0.63416534]]], [[[0.68575974, 0.97161674, 0.64352121], [0.01774847, 0.19088563, 0.5592364 ], [0.36843131, 0.32959789, 0.64224325]], [[0.68470177, 0.64245899, 0.46212761], [0.15039951, 0.10811888, 0.40851055], [0.81947967, 0.25196819, 0.8548687 ]], [[0.65383363, 0.71990788, 0.99974515], [0.86936929, 0.88291668, 0.42426947], [0.65812031, 0.31879799, 0.26382363]]], [[[0.22634595, 0.681596 , 0.5429494 ], [0.14003415, 0.43413934, 0.23896276], [0.80515331, 0.61265157, 0.58588418]], [[0.91423483, 0.06999964, 0.1682785 ], [0.72234257, 0.82941543, 0.19774106], [0.88382618, 0.21445045, 0.2751417 ]], [[0.14681642, 0.51629796, 0.54012485], [0.84924673, 0.45316992, 0.76461479], [0.22357064, 0.93727837, 0.67622125]]]])
- C(i, j, k, x, t)float640.6665 0.5983 ... 0.1725 0.5512
array([[[[[0.6665399 , 0.59832748, 0.18114782, ..., 0.36099979, 0.35219893, 0.4980035 ], [0.38333312, 0.44772288, 0.75653923, ..., 0.37589656, 0.13369065, 0.47987254], [0.50597047, 0.29550538, 0.9778966 , ..., 0.65726992, 0.99320153, 0.18496399]], [[0.60049151, 0.1592819 , 0.17999926, ..., 0.00779688, 0.14498246, 0.07304089], [0.0750356 , 0.33464811, 0.70128993, ..., 0.59543439, 0.1178793 , 0.69901122], [0.98287347, 0.56428769, 0.85329385, ..., 0.71458827, 0.60211 , 0.29172416]], [[0.89687166, 0.41477282, 0.22402112, ..., 0.0515774 , 0.0244558 , 0.05581806], [0.66849563, 0.79567456, 0.61869232, ..., 0.31614919, 0.92042898, 0.4472749 ], [0.67824257, 0.47048341, 0.27227653, ..., 0.80776232, 0.07571938, 0.65017238]]], ... [[[0.97030569, 0.77174593, 0.2505688 , ..., 0.09537329, 0.95493039, 0.27365772], [0.04046798, 0.51810465, 0.369942 , ..., 0.3368151 , 0.99361475, 0.06085155], [0.70620457, 0.7939152 , 0.36416209, ..., 0.12769387, 0.16230367, 0.14463289]], [[0.17254376, 0.91302381, 0.33566977, ..., 0.0488496 , 0.79843394, 0.44686085], [0.66721966, 0.88184028, 0.63526219, ..., 0.9088871 , 0.46182789, 0.00562647], [0.55156482, 0.22742759, 0.32103852, ..., 0.86468682, 0.07778079, 0.3497435 ]], [[0.69821763, 0.52170507, 0.49416937, ..., 0.19003424, 0.70830838, 0.68898569], [0.74154529, 0.19323886, 0.56081469, ..., 0.07342518, 0.23961017, 0.46665842], [0.66622674, 0.33432472, 0.34084444, ..., 0.38795275, 0.1725056 , 0.55120919]]]]])
We can see the dimensions 'i'
, 'j'
and 'k'
have been generated by the combos for all variables, as well as the ‘internal’ dimensions 'x'
and 't'
only for 'B'
and 'C'
. See also the Structured Output with Julia Set Example for a fuller demonstration.
Finally, if the function itself returns a xarray.Dataset
, then just use var_names=None
and all the outputs will be concatenated together automatically. The overhead this incurs is often negligible for anything but very fast functions.
1.3. Aggregating data - Harvester
#
A common scenario when running simulations is the following:
Generate some data
Save it to disk
Generate a different set of data (maybe after analysis of the first set)
Load the old data
Merge the new data with the old data
Save the new combined data
Repeat
The aim of the Harvester
is to automate that process. A Harvester
is instantiated with a Runner
instance and, optionally, a data_name
. If a data_name
is given, then every time a round of combos/cases is generated, it will be automatically synced with a on-disk dataset of that name. Either way, the harvester will aggregate all runs into the full_ds
attribute.
[11]:
combos = [
('a', [1, 2, 3]),
('b', ['x', 'y', 'z']),
('c', [True, False]),
]
harvester = Harvester(runner, data_name='foo.h5')
harvester.harvest_combos(combos)
100%|##########| 18/18 [00:00<00:00, 114217.05it/s]
Which, because it didn’t exist yet, created the file data_name
:
[12]:
ls *.h5
foo.h5*
xyzpy.Harvester.harvest_combos()
calls xyzpy.Runner.run_combos()
itself - this doesn’t need to be done seperately.
Now we can run a second set of different combos:
[13]:
combos2 = {
'a': [4, 5, 6],
'b': ['w', 'v'],
'c': [True, False],
}
harvester.harvest_combos(combos2)
100%|##########| 12/12 [00:00<00:00, 42762.66it/s]
Now we can check the total dataset containing all combos and cases run so far:
[14]:
harvester.full_ds
[14]:
<xarray.Dataset> Dimensions: (a: 6, b: 5, c: 2) Coordinates: * a (a) int64 1 2 3 4 5 6 * b (b) object 'v' 'w' 'x' 'y' 'z' * c (c) bool True False Data variables: a_out (a, b, c) object nan nan nan nan '1-x-True' ... nan nan nan nan nan b_out (a, b, c) float64 nan nan nan nan 0.8415 ... nan nan nan nan nan
- a: 6
- b: 5
- c: 2
- a(a)int641 2 3 4 5 6
array([1, 2, 3, 4, 5, 6])
- b(b)object'v' 'w' 'x' 'y' 'z'
array(['v', 'w', 'x', 'y', 'z'], dtype=object)
- c(c)boolTrue False
array([ True, False])
- a_out(a, b, c)objectnan nan nan nan ... nan nan nan nan
array([[[nan, nan], [nan, nan], ['1-x-True', '1-x-False'], ['1-y-True', '1-y-False'], ['1-z-True', '1-z-False']], [[nan, nan], [nan, nan], ['2-x-True', '2-x-False'], ['2-y-True', '2-y-False'], ['2-z-True', '2-z-False']], [[nan, nan], [nan, nan], ['3-x-True', '3-x-False'], ['3-y-True', '3-y-False'], ['3-z-True', '3-z-False']], [['4-v-True', '4-v-False'], ['4-w-True', '4-w-False'], [nan, nan], [nan, nan], [nan, nan]], [['5-v-True', '5-v-False'], ['5-w-True', '5-w-False'], [nan, nan], [nan, nan], [nan, nan]], [['6-v-True', '6-v-False'], ['6-w-True', '6-w-False'], [nan, nan], [nan, nan], [nan, nan]]], dtype=object)
- b_out(a, b, c)float64nan nan nan nan ... nan nan nan nan
array([[[ nan, nan], [ nan, nan], [ 0.84147098, 0.84147098], [ 0.84147098, 0.84147098], [ 0.84147098, 0.84147098]], [[ nan, nan], [ nan, nan], [ 0.90929743, 0.90929743], [ 0.90929743, 0.90929743], [ 0.90929743, 0.90929743]], [[ nan, nan], [ nan, nan], [ 0.14112001, 0.14112001], [ 0.14112001, 0.14112001], [ 0.14112001, 0.14112001]], [[-0.7568025 , -0.7568025 ], [-0.7568025 , -0.7568025 ], [ nan, nan], [ nan, nan], [ nan, nan]], [[-0.95892427, -0.95892427], [-0.95892427, -0.95892427], [ nan, nan], [ nan, nan], [ nan, nan]], [[-0.2794155 , -0.2794155 ], [-0.2794155 , -0.2794155 ], [ nan, nan], [ nan, nan], [ nan, nan]]])
[15]:
@label(var_names=['sum', 'diff'], harvester='foo.h5')
def foo(x, y):
return x + y, x - y
foo
[15]:
<xyzpy.Harvester>
Runner: <xyzpy.Runner>
fn: <function foo at 0x7f72b035a3a0>
fn_args: ('x', 'y')
var_names: ('sum', 'diff')
var_dims: {'sum': (), 'diff': ()}
Sync file -->
foo.h5 [h5netcdf]
Note that, since the different runs were disjoint, missing values have automatically been filled in with nan
values - see xarray.merge()
. The on-disk dataset now contains both runs.
Hint
As a convenience, label()
can also be used to decorate a function as a xyzpy.Harvester
by supplying the harvester
kwarg. If True
a harvester will be instantiated with data_name=None
.
If a string, it is used as the data_name
.
>>> @label(var_names=['sum', 'diff'], harvester='foo.h5')
... def foo(x, y):
... return x + y, x - y
...
>>> foo
<xyzpy.Harvester>
Runner: <xyzpy.Runner>
fn: <function foo at 0x7f6217a2b550>
fn_args: ('x', 'y')
var_names: ('sum', 'diff')
var_dims: {'sum': (), 'diff': ()}
Sync file -->
foo.h5 [h5netcdf]
1.4. Aggregating Random samples of data - Sampler
#
Occasionally, exhaustively iterating through all combinations of arguments is unneccesary. If instead you just want to sample the parameter space sparsely then the Sampler
object allows this with much the same interface as a Harvester
. The main difference is that, since the parameters are no longer gridded, the data is stored as a table in a
pandas.DataFrame
.
[16]:
import math
import random
@label(var_names=['out'])
def trig(amp, fn, x, phase):
return amp * getattr(math, fn)(x - phase)
# these are the default combos/distributions to sample from
default_combos = {
'amp': [1, 2, 3],
'fn': ['cos', 'sin'],
# for distributions we can supply callables
'x': lambda: 2 * math.pi * random.random(),
'phase': lambda: random.gauss(0.0, 0.1),
}
sampler = Sampler(trig, 'trig.pkl', default_combos)
sampler
[16]:
<xyzpy.Sampler>
Runner: <xyzpy.Runner>
fn: <function trig at 0x7f72b035a9d0>
fn_args: ('amp', 'fn', 'x', 'phase')
var_names: ('out',)
var_dims: {'out': ()}
Sync file -->
trig.pkl [pickle]
Now we can run the sampler many times (and supply any of the usual arguments such as parallel=True
etc). This generates a pandas.DataFrame
:
[17]:
sampler.sample_combos(10000);
100%|##########| 10000/10000 [00:00<00:00, 305687.24it/s]
This has also synced the data with the on-disk file:
[18]:
!ls *.pkl
trig.pkl
You can specify Sampler(..., engine='csv')
etc to use formats other than pickle
.
As with the Harvester
, next time we run combinations, the data is automatically aggregated into the full set:
[19]:
# here we will override some of the default sampling choices
combos = {
'fn': ['tan'],
'x': lambda: random.random() * math.pi / 4
}
sampler.sample_combos(5000, combos);
100%|##########| 5000/5000 [00:00<00:00, 299970.25it/s]
We can then use tools such as seaborn to visualize the full data:
[20]:
import seaborn as sns
sns.relplot(x='x', y='out', hue='fn', size='amp', data=sampler.full_df)
[20]:
<seaborn.axisgrid.FacetGrid at 0x7f72b029f8e0>
Hint
As a convenience, label()
can also be used to decorate a function as a xyzpy.Sampler
by supplying the sampler
kwarg. If True
a sampler will be instantiated with data_name=None
.
If a string, it is used as the data_name
.
1.5. Summary#
combo_runner()
is the core function which outputs a nested tuple and contains the parallelization logic and progress display etc.
Runner
andxyzpy.Runner.run_combos()
are used to describe the function’s output and perform a single set of runs yielding aDataset
. These internally callcombo_runner()
.
Harvester
andxyzpy.Harvester.harvest_combos()
are used to perform many sets of runs, continuously merging the results into one largerDataset
-Harvester.full_ds
, probably synced to disk. These internally callxyzpy.Runner.run_combos()
.
Sampler
andxyzpy.Sampler.sample_combos()
are used to sparsely sample from parameter combinations. Unlike a normalHarvester
, the data is aggregated automatically into apandas.DataFrame
.
In general, you would only generate data with one of these methods at once - see the full demonstrations in Examples.
[21]:
# some cleanup
harvester.delete_ds()
sampler.delete_df()