Parallel & Distributed Generation
Contents
2. Parallel & Distributed Generation#
2.1. Parallel Generation - num_workers
and pool
#
Running a function for many different parameters theoretically allows perfect parallelization since each run is independent. xyzpy
can automatically handle this in a number of different ways:
Supply
parallel=True
when callingRunner.run_combos(...)
orHarvester.harvest_combos(...)
etc. This spawns aProcessExecutorPool
with the same number of workers as logical cores.Supply
num_workers=...
instead to explicitly control how any workers are used. Since for many numeric codes threading is controlled by the environement variable$OMP_NUM_THREADS
you generally want the product of this andnum_workers
to be equal to the number of cores.Supply
executor=...
to use any custom parallel pool-executor like object (e.g. adask.distributed
client ormpi4py
pool) which has asubmit
/apply_async
method, and yields futures with aresult
/get
method. More specifically, this covers pools with an API matching eitherconcurrent.futures
or anipyparallel
view. Pools frommultiprocessing.pool
are also explicitly handled.Use a
Crop
to write combos to disk, which can then be ‘grown’ persistently by any computers with access to the filesystem, such as distributed cluster - see below.
The first three options can be used on any of the various functions derived from combo_runner()
:
[1]:
import xyzpy as xyz
def slow(a, b):
import time
time.sleep(1)
return a + b, a - b
xyz.case_runner_to_df(
slow,
fn_args=['a', 'b'],
cases=[(i, i + 1) for i in range(8)],
var_names=['sum', 'diff'],
parallel=True,
)
100%|##########| 8/8 [00:02<00:00, 3.44it/s]
[1]:
a | b | sum | diff | |
---|---|---|---|---|
0 | 0 | 1 | 1 | -1 |
1 | 1 | 2 | 3 | -1 |
2 | 2 | 3 | 5 | -1 |
3 | 3 | 4 | 7 | -1 |
4 | 4 | 5 | 9 | -1 |
5 | 5 | 6 | 11 | -1 |
6 | 6 | 7 | 13 | -1 |
7 | 7 | 8 | 15 | -1 |
2.2. Batched / Distributed generation - Crop
#
Running combos using the disk as a persistence mechanism requires one more object - the Crop
. These can be instantiated directly or, generally more convenient, from a parent Runner
or Harvester
:
xyzpy.Runner.Crop()
and xyzpy.Harvester.Crop()
. Using a Crop
requires a number of steps:
Creation with:
a unique
name
to call this set of runsa
fn
if not creating from anHarvester
orRunner
other optional settings such as
batchsize
controlling how many runs to group into one.
‘Sow’. Use
xyzpy.Crop.sow_combos()
to writecombos
into batches on disk.‘Grow’. Grow each batch. This can be done a number of ways:
In another different process, navigate to the same directory and run for example
python -c "import xyzpy; c = xyzpy.Crop(name=...); xyzpy.grow(i, crop=c)"
to grow the ith batch of crop with specified name. Seegrow()
for other options. This could manually be put in a script to run on a batch system.Use
xyzpy.Crop.grow_cluster()
- experimental! This automatically generates and submits a script using SGE, PBS or SLURM. See its options andxyzpy.Crop.gen_cluster_script()
for the template scripts.Use
xyzpy.Crop.grow()
orxyzpy.Crop.grow_missing()
to complete some or all of the batches locally. This can be useful to a) finish up a few missing/errored runs b) run all the combos with persistent progress, so that one can restart the runs at a completely different time/ with updated functions etc.
Watch the progress.
Crop.__repr__
will show how many batches have been completed of the total sown.‘Reap’. Once all the batches have completed, run
xyzpy.Crop.reap()
to collect the results and remove the batches’ temporary directory. If the crop originated from aRunner
orHarvester
, the data will be labelled, merged and saved accordingly.
See the full demonstrations in Examples.
Note
You can reap an unfinished Crop
as long as there is at least one result by passing the allow_incomplete=True
option to reap()
.
Note that missing results will be represented by numpy.nan
which might effect the eventual dtype
of harvested results.
To avoid this, consider also setting sync=False
to avoid writing anything to disk until the full Crop
is finished.
Hint
You can automatically load all crops in the current directory (or a specific one) to a dictionary by calling the function xyzpy.load_crops()
.