xyzpy.gen.cropping
xyzpy.gen.cropping#
Functions
|
Logic for choosing whether to automatically clean up a crop, and what, if any, the default all-nan result should be. |
|
|
|
|
|
Generate a cluster script to grow a Crop. |
|
Generate a qsub script to grow a Crop. |
|
|
|
Automatically process a batch of cases into results. |
|
Automagically submit SGE, PBS, or slurm jobs to grow all missing results. |
|
Automatically load all the crops found in the current directory. |
|
Work out how to structure the sowed data. |
|
|
|
Automagically submit SGE or PBS jobs to grow all missing results. |
|
|
|
|
|
Classes
|
Encapsulates all the details describing a single 'crop', that is, its location, name, and batch size/number. |
|
Class that acts as a stateful function to retrieve already sown and grow results. |
|
Class for sowing a 'crop' of batched combos to then 'grow' (on any number of workers sharing the filesystem) and then reap. |
- class xyzpy.gen.cropping.Crop(*, fn=None, name=None, parent_dir=None, save_fn=None, batchsize=None, num_batches=None, shuffle=False, farmer=None, autoload=True)[source]#
Encapsulates all the details describing a single ‘crop’, that is, its location, name, and batch size/number. Also allows tracking of crop’s progress, and experimentally, automatic submission of workers to grid engine to complete un-grown cases. Can also be instantiated directly from a
Runner
orHarvester
orCrop
instance.- Parameters
fn (callable, optional) – Target function - Crop name will be inferred from this if not given explicitly. If given, Sower will also default to saving a version of fn to disk for cropping.grow to use.
name (str, optional) – Custom name for this set of runs - must be given if fn is not.
parent_dir (str, optional) – If given, alternative directory to put the “.xyz-{name}/” folder in with all the cases and results.
save_fn (bool, optional) – Whether to save the function to disk for cropping.grow to use. Will default to True if fn is given.
batchsize (int, optional) – How many cases to group into a single batch per worker. By default, batchsize=1. Cannot be specified if num_batches is.
num_batches (int, optional) – How many total batches to aim for, cannot be specified if batchsize is.
farmer ({xyzpy.Runner, xyzpy.Harvester, xyzpy.Sampler}, optional) – A Runner, Harvester or Sampler, instance, from which the fn can be inferred and which can also allow the Crop to reap itself straight to a dataset or dataframe.
autoload (bool, optional) – If True, check for the existence of a Crop written to disk with the same location, and if found, load it.
See also
Runner.Crop
,Harvester.Crop
,Sampler.Crop
- property all_nan_result#
Get a stand-in result for cases which are missing still.
- check_bad(delete_bad=True)[source]#
Check that the result dumps are not bad -> sometimes length does not match the batch. Optionally delete these so that they can be re-grown.
- choose_batch_settings(*, combos=None, cases=None)[source]#
Work out how to divide all cases into batches, i.e. ensure that
batchsize * num_batches >= num_cases
.
- property fn#
Function to save with the Crop for automatic loading and running. Default crop name will be inferred from this ifnot given explicitly as well.
- gen_cluster_script(scheduler, batch_ids=None, *, hours=None, minutes=None, seconds=None, gigabytes=2, num_procs=1, num_threads=None, num_nodes=1, launcher='python', setup='#', shell_setup='', mpi=False, temp_gigabytes=1, output_directory=None, extra_resources=None, debugging=False)#
Generate a cluster script to grow a Crop.
- Parameters
crop (Crop) – The crop to grow.
scheduler ({'sge', 'pbs', 'slurm'}) – Whether to use a SGE, PBS or slurm submission script template.
batch_ids (int or tuple[int]) – Which batch numbers to grow, defaults to all missing batches.
hours (int) – How many hours to request, default=0.
minutes (int, optional) – How many minutes to request, default=20.
seconds (int, optional) – How many seconds to request, default=0.
gigabytes (int, optional) – How much memory to request, default: 2.
num_procs (int, optional) – How many processes to request (threaded cores or MPI), default: 1.
launcher (str, optional) – How to launch the script, default:
'python'
. But could for example be'mpiexec python'
for a MPI program.setup (str, optional) – Python script to run before growing, for things that shouldnt’t be put in the crop function itself, e.g. one-time imports with side-effects like:
"import tensorflow as tf; tf.enable_eager_execution()
”.shell_setup (str, optional) – Commands to be run by the shell before the python script is executed. E.g.
conda activate my_env
.mpi (bool, optional) – Request MPI processes not threaded processes.
temp_gigabytes (int, optional) – How much temporary on-disk memory.
output_directory (str, optional) – What directory to write output to. Defaults to “$HOME/Scratch/output”.
extra_resources (str, optional) – Extra “#$ -l” resources, e.g. ‘gpu=1’
debugging (bool, optional) – Set the python log level to debugging.
- Return type
- gen_qsub_script(batch_ids=None, *, scheduler='sge', **kwargs)#
Generate a qsub script to grow a Crop. Deprecated in favour of gen_cluster_script and will be removed in the future.
- grow_cluster(scheduler, batch_ids=None, *, hours=None, minutes=None, seconds=None, gigabytes=2, num_procs=1, num_threads=None, num_nodes=1, launcher='python', setup='#', shell_setup='', mpi=False, temp_gigabytes=1, output_directory=None, extra_resources=None, debugging=False)#
Automagically submit SGE, PBS, or slurm jobs to grow all missing results.
- Parameters
crop (Crop) – The crop to grow.
scheduler ({'sge', 'pbs', 'slurm'}) – Whether to use a SGE, PBS or slurm submission script template.
batch_ids (int or tuple[int]) – Which batch numbers to grow, defaults to all missing batches.
hours (int) – How many hours to request, default=0.
minutes (int, optional) – How many minutes to request, default=20.
seconds (int, optional) – How many seconds to request, default=0.
gigabytes (int, optional) – How much memory to request, default: 2.
num_procs (int, optional) – How many processes to request (threaded cores or MPI), default: 1.
launcher (str, optional) – How to launch the script, default:
'python'
. But could for example be'mpiexec python'
for a MPI program.setup (str, optional) – Python script to run before growing, for things that shouldnt’t be put in the crop function itself, e.g. one-time imports with side-effects like:
"import tensorflow as tf; tf.enable_eager_execution()
”.shell_setup (str, optional) – Commands to be run by the shell before the python script is executed. E.g.
conda activate my_env
.mpi (bool, optional) – Request MPI processes not threaded processes.
temp_gigabytes (int, optional) – How much temporary on-disk memory.
output_directory (str, optional) – What directory to write output to. Defaults to “$HOME/Scratch/output”.
extra_resources (str, optional) – Extra “#$ -l” resources, e.g. ‘gpu=1’
debugging (bool, optional) – Set the python log level to debugging.
- load_function()[source]#
Load the saved function from disk, and try to re-insert it back into Harvester or Runner if present.
- property num_sown_batches#
Total number of batches to be run/grown.
- prepare(combos=None, cases=None, fn_args=None)[source]#
Write information about this crop and the supplied combos to disk. Typically done at start of sow, not when Crop instantiated.
- qsub_grow(batch_ids=None, *, scheduler='sge', **kwargs)#
Automagically submit SGE or PBS jobs to grow all missing results. Deprecated in favour of grow_cluster and will be removed in the future.
- reap(wait=False, sync=True, overwrite=None, clean_up=None, allow_incomplete=False)[source]#
Reap sown and grown combos from disk. Return a dataset if a runner or harvester is set, otherwise, the raw nested tuple.
- Parameters
wait (bool, optional) – Whether to wait for results to appear. If false (default) all results need to be in place before the reap.
sync (bool, optional) – Immediately sync the new dataset with the on-disk full dataset or dataframe if a harvester or sampler is used.
overwrite (bool, optional) – How to compare data when syncing to on-disk dataset. If
None
, (default) merge as long as no conflicts.True
: overwrite with the new data.False
, discard any new conflicting data.clean_up (bool, optional) – Whether to delete all the batch files once the results have been gathered. If left as
None
this will be automatically set tonot allow_incomplete
.allow_incomplete (bool, optional) – Allow only partially completed crop results to be reaped, incomplete results will all be filled-in as nan.
- Return type
nested tuple or xarray.Dataset
- reap_combos(wait=False, clean_up=None, allow_incomplete=False)[source]#
Reap already sown and grown results from this crop.
- Parameters
wait (bool, optional) – Whether to wait for results to appear. If false (default) all results need to be in place before the reap.
clean_up (bool, optional) – Whether to delete all the batch files once the results have been gathered. If left as
None
this will be automatically set tonot allow_incomplete
.allow_incomplete (bool, optional) – Allow only partially completed crop results to be reaped, incomplete results will all be filled-in as nan.
- Returns
results – ‘N-dimensional’ tuple containing the results.
- Return type
nested tuple
- reap_combos_to_ds(var_names=None, var_dims=None, var_coords=None, constants=None, attrs=None, parse=True, wait=False, clean_up=None, allow_incomplete=False, to_df=False)[source]#
Reap a function over sowed combinations and output to a Dataset.
- Parameters
var_names (str, sequence of strings, or None) – Variable name(s) of the output(s) of fn, set to None if fn outputs data already labelled in a Dataset or DataArray.
var_dims (sequence of either strings or string sequences, optional) – ‘Internal’ names of dimensions for each variable, the values for each dimension should be contained as a mapping in either var_coords (not needed by fn) or constants (needed by fn).
var_coords (mapping, optional) – Mapping of extra coords the output variables may depend on.
constants (mapping, optional) – Arguments to fn which are not iterated over, these will be recorded either as attributes or coordinates if they are named in var_dims.
resources (mapping, optional) – Like constants but they will not be recorded.
attrs (mapping, optional) – Any extra attributes to store.
wait (bool, optional) – Whether to wait for results to appear. If false (default) all results need to be in place before the reap.
clean_up (bool, optional) – Whether to delete all the batch files once the results have been gathered. If left as
None
this will be automatically set tonot allow_incomplete
.allow_incomplete (bool, optional) – Allow only partially completed crop results to be reaped, incomplete results will all be filled-in as nan.
to_df (bool, optional) – Whether to reap to a
xarray.Dataset
or apandas.DataFrame
.
- Returns
Multidimensional labelled dataset contatining all the results.
- Return type
xarray.Dataset or pandas.Dataframe
- reap_harvest(harvester, wait=False, sync=True, overwrite=None, clean_up=None, allow_incomplete=False)[source]#
Reap a Crop over sowed combos and merge with the dataset defined by a
Harvester
.
- reap_runner(runner, wait=False, clean_up=None, allow_incomplete=False, to_df=False)[source]#
Reap a Crop over sowed combos and save to a dataset defined by a
Runner
.
- reap_samples(sampler, wait=False, sync=True, clean_up=None, allow_incomplete=False)[source]#
Reap a Crop over sowed combos and merge with the dataframe defined by a
Sampler
.
- sow_cases(fn_args, cases, combos=None, constants=None, verbosity=1, batchsize=None, num_batches=None)[source]#
Sow cases to disk to be later grown, potentially in batches.
- Parameters
fn_args (iterable[str] or str) – The names and order of the function arguments, can be
None
if each case is supplied as adict
.cases (iterable or mappings, optional) – Sequence of individual cases to sow for all or some function arguments.
combos (dict_like[str, iterable]) – Combinations to sow for some or all function arguments.
constants (mapping, optional) – Provide additional constant function values to use when sowing.
verbosity (int, optional) – How much information to show when sowing.
batchsize (int, optional) – If specified, set a new batchsize for the crop.
num_batches (int, optional) – If specified, set a new num_batches for the crop.
- sow_combos(combos, cases=None, constants=None, shuffle=False, verbosity=1, batchsize=None, num_batches=None)[source]#
Sow combos to disk to be later grown, potentially in batches. Note if you have already sown this Crop, as long as the number of batches hasn’t changed (e.g. you have just tweaked the function or a constant argument), you can safely resow and only the batches will be overwritten, i.e. the results will remain.
- Parameters
combos (dict_like[str, iterable]) – The combinations to sow for all or some function arguments.
cases (iterable or mappings, optional) – Optionally provide an sequence of individual cases to sow for some or all function arguments.
constants (mapping, optional) – Provide additional constant function values to use when sowing.
shuffle (bool or int, optional) – If given, sow the combos in a random order (using
random.seed
andrandom.shuffle
), which can be helpful for distributing resources when not all cases are computationally equal.verbosity (int, optional) – How much information to show when sowing.
batchsize (int, optional) – If specified, set a new batchsize for the crop.
num_batches (int, optional) – If specified, set a new num_batches for the crop.
- class xyzpy.gen.cropping.Reaper(crop, num_batches, wait=False, default_result=None)[source]#
Class that acts as a stateful function to retrieve already sown and grow results.
- class xyzpy.gen.cropping.Sower(crop)[source]#
Class for sowing a ‘crop’ of batched combos to then ‘grow’ (on any number of workers sharing the filesystem) and then reap.
- xyzpy.gen.cropping.calc_clean_up_default_res(crop, clean_up, allow_incomplete)[source]#
Logic for choosing whether to automatically clean up a crop, and what, if any, the default all-nan result should be.
- xyzpy.gen.cropping.gen_cluster_script(crop, scheduler, batch_ids=None, *, hours=None, minutes=None, seconds=None, gigabytes=2, num_procs=1, num_threads=None, num_nodes=1, launcher='python', setup='#', shell_setup='', mpi=False, temp_gigabytes=1, output_directory=None, extra_resources=None, debugging=False)[source]#
Generate a cluster script to grow a Crop.
- Parameters
crop (Crop) – The crop to grow.
scheduler ({'sge', 'pbs', 'slurm'}) – Whether to use a SGE, PBS or slurm submission script template.
batch_ids (int or tuple[int]) – Which batch numbers to grow, defaults to all missing batches.
hours (int) – How many hours to request, default=0.
minutes (int, optional) – How many minutes to request, default=20.
seconds (int, optional) – How many seconds to request, default=0.
gigabytes (int, optional) – How much memory to request, default: 2.
num_procs (int, optional) – How many processes to request (threaded cores or MPI), default: 1.
launcher (str, optional) – How to launch the script, default:
'python'
. But could for example be'mpiexec python'
for a MPI program.setup (str, optional) – Python script to run before growing, for things that shouldnt’t be put in the crop function itself, e.g. one-time imports with side-effects like:
"import tensorflow as tf; tf.enable_eager_execution()
”.shell_setup (str, optional) – Commands to be run by the shell before the python script is executed. E.g.
conda activate my_env
.mpi (bool, optional) – Request MPI processes not threaded processes.
temp_gigabytes (int, optional) – How much temporary on-disk memory.
output_directory (str, optional) – What directory to write output to. Defaults to “$HOME/Scratch/output”.
extra_resources (str, optional) – Extra “#$ -l” resources, e.g. ‘gpu=1’
debugging (bool, optional) – Set the python log level to debugging.
- Return type
- xyzpy.gen.cropping.gen_qsub_script(crop, batch_ids=None, *, scheduler='sge', **kwargs)[source]#
Generate a qsub script to grow a Crop. Deprecated in favour of gen_cluster_script and will be removed in the future.
- xyzpy.gen.cropping.grow(batch_number, crop=None, fn=None, check_mpi=True, verbosity=2, debugging=False)[source]#
Automatically process a batch of cases into results. Should be run in an “.xyz-{fn_name}” folder.
- Parameters
batch_number (int) – Which batch to ‘grow’ into a set of results.
crop (xyzpy.Crop) – Description of where and how to store the cases and results.
fn (callable, optional) – If specified, the function used to generate the results, otherwise the function will be loaded from disk.
check_mpi (bool, optional) – Whether to check if the process is rank 0 and only save results if so - allows mpi functions to be simply used. Defaults to true, this should only be turned off if e.g. a pool of workers is being used to run different
grow
instances.verbosity ({0, 1, 2}, optional) – How much information to show.
debugging (bool, optional) – Set logging level to DEBUG.
- xyzpy.gen.cropping.grow_cluster(crop, scheduler, batch_ids=None, *, hours=None, minutes=None, seconds=None, gigabytes=2, num_procs=1, num_threads=None, num_nodes=1, launcher='python', setup='#', shell_setup='', mpi=False, temp_gigabytes=1, output_directory=None, extra_resources=None, debugging=False)[source]#
Automagically submit SGE, PBS, or slurm jobs to grow all missing results.
- Parameters
crop (Crop) – The crop to grow.
scheduler ({'sge', 'pbs', 'slurm'}) – Whether to use a SGE, PBS or slurm submission script template.
batch_ids (int or tuple[int]) – Which batch numbers to grow, defaults to all missing batches.
hours (int) – How many hours to request, default=0.
minutes (int, optional) – How many minutes to request, default=20.
seconds (int, optional) – How many seconds to request, default=0.
gigabytes (int, optional) – How much memory to request, default: 2.
num_procs (int, optional) – How many processes to request (threaded cores or MPI), default: 1.
launcher (str, optional) – How to launch the script, default:
'python'
. But could for example be'mpiexec python'
for a MPI program.setup (str, optional) – Python script to run before growing, for things that shouldnt’t be put in the crop function itself, e.g. one-time imports with side-effects like:
"import tensorflow as tf; tf.enable_eager_execution()
”.shell_setup (str, optional) – Commands to be run by the shell before the python script is executed. E.g.
conda activate my_env
.mpi (bool, optional) – Request MPI processes not threaded processes.
temp_gigabytes (int, optional) – How much temporary on-disk memory.
output_directory (str, optional) – What directory to write output to. Defaults to “$HOME/Scratch/output”.
extra_resources (str, optional) – Extra “#$ -l” resources, e.g. ‘gpu=1’
debugging (bool, optional) – Set the python log level to debugging.
- xyzpy.gen.cropping.load_crops(directory='.')[source]#
Automatically load all the crops found in the current directory.
- xyzpy.gen.cropping.parse_crop_details(fn, crop_name, crop_parent)[source]#
Work out how to structure the sowed data.
- Parameters
- Returns
crop_location (str) – Full path to the crop-folder.
crop_name (str) – Name of the crop.
crop_parent (str) – Parent folder of the crop.