2. Parallel & Distributed Generation#

2.1. Parallel Generation - num_workers and pool#

Running a function for many different parameters theoretically allows perfect parallelization since each run is independent. xyzpy can automatically handle this in a number of different ways:

  1. Supply parallel=True when calling Runner.run_combos(...) or Harvester.harvest_combos(...) etc. This spawns a ProcessExecutorPool with the same number of workers as logical cores.

  2. Supply num_workers=... instead to explicitly control how any workers are used. Since for many numeric codes threading is controlled by the environement variable $OMP_NUM_THREADS you generally want the product of this and num_workers to be equal to the number of cores.

  3. Supply executor=... to use any custom parallel pool-executor like object (e.g. a dask.distributed client or mpi4py pool) which has a submit/apply_async method, and yields futures with a result/get method. More specifically, this covers pools with an API matching either concurrent.futures or an ipyparallel view. Pools from multiprocessing.pool are also explicitly handled.

  4. Use a Crop to write combos to disk, which can then be ‘grown’ persistently by any computers with access to the filesystem, such as distributed cluster - see below.

The first three options can be used on any of the various functions derived from combo_runner():

[1]:
import xyzpy as xyz

def slow(a, b):
    import time
    time.sleep(1)
    return a + b, a - b

xyz.case_runner_to_df(
    slow,
    fn_args=['a', 'b'],
    cases=[(i, i + 1) for i in range(8)],
    var_names=['sum', 'diff'],
    parallel=True,
)
100%|##########| 8/8 [00:02<00:00,  3.44it/s]
[1]:
a b sum diff
0 0 1 1 -1
1 1 2 3 -1
2 2 3 5 -1
3 3 4 7 -1
4 4 5 9 -1
5 5 6 11 -1
6 6 7 13 -1
7 7 8 15 -1

2.2. Batched / Distributed generation - Crop#

Running combos using the disk as a persistence mechanism requires one more object - the Crop. These can be instantiated directly or, generally more convenient, from a parent Runner or Harvester: xyzpy.Runner.Crop() and xyzpy.Harvester.Crop(). Using a Crop requires a number of steps:

  1. Creation with:

    • a unique name to call this set of runs

    • a fn if not creating from an Harvester or Runner

    • other optional settings such as batchsize controlling how many runs to group into one.

  2. ‘Sow’. Use xyzpy.Crop.sow_combos() to write combos into batches on disk.

  3. ‘Grow’. Grow each batch. This can be done a number of ways:

    • In another different process, navigate to the same directory and run for example python -c "import xyzpy; c = xyzpy.Crop(name=...); xyzpy.grow(i, crop=c)" to grow the ith batch of crop with specified name. See grow() for other options. This could manually be put in a script to run on a batch system.

    • Use xyzpy.Crop.grow_cluster() - experimental! This automatically generates and submits a script using SGE, PBS or SLURM. See its options and xyzpy.Crop.gen_cluster_script() for the template scripts.

    • Use xyzpy.Crop.grow() or xyzpy.Crop.grow_missing() to complete some or all of the batches locally. This can be useful to a) finish up a few missing/errored runs b) run all the combos with persistent progress, so that one can restart the runs at a completely different time/ with updated functions etc.

  4. Watch the progress. Crop.__repr__ will show how many batches have been completed of the total sown.

  5. ‘Reap’. Once all the batches have completed, run xyzpy.Crop.reap() to collect the results and remove the batches’ temporary directory. If the crop originated from a Runner or Harvester, the data will be labelled, merged and saved accordingly.

See the full demonstrations in Examples.

Note

You can reap an unfinished Crop as long as there is at least one result by passing the allow_incomplete=True option to reap(). Note that missing results will be represented by numpy.nan which might effect the eventual dtype of harvested results. To avoid this, consider also setting sync=False to avoid writing anything to disk until the full Crop is finished.

Hint

You can automatically load all crops in the current directory (or a specific one) to a dictionary by calling the function xyzpy.load_crops().