xyzpy.manage

Manage datasets — loading, saving, merging etc.

Attributes

Functions

cache_to_disk([fn, cachedir])

Cache this function to disk, using joblib.

auto_add_extension(file_name, engine)

Ensure file_name has an extension matching engine.

save_ds(ds, file_name[, engine])

Saves a xarray dataset.

load_ds(file_name[, engine, load_to_mem, create_new, ...])

Loads a xarray dataset. Basically xarray.open_dataset with some

save_merge_ds(ds, fname[, overwrite])

Save dataset ds, but check for an existing dataset with that name

trimna(obj)

Drop values across dims where all values are NaN.

sort_dims(ds)

Reorder variable dimensions to match ds.dims. This is an inplace

post_fix(ds, postfix)

Append "_{postfix}" to each data variable name.

check_runs(obj[, dim, var, sel])

Print out information about the range and any missing values for an

auto_xyz_ds(x[, y_z])

Automatically turn an array into a xarray dataset. Transpose y_z

merge_sync_conflict_datasets(base_name[, engine, ...])

Glob files based on base_name, merge them, save this new dataset if

save_df(df, name[, engine, key])

Save a dataframe to disk.

load_df(name[, engine, key])

Load a dataframe from disk.

Module Contents

xyzpy.manage._DEFAULT_FN_CACHE_PATH = '__xyz_cache__'
xyzpy.manage.cache_to_disk(fn=None, *, cachedir=_DEFAULT_FN_CACHE_PATH, **kwargs)[source]

Cache this function to disk, using joblib.

xyzpy.manage._engine_extensions
xyzpy.manage.auto_add_extension(file_name, engine)[source]

Ensure file_name has an extension matching engine.

Parameters:
  • file_name (str) – File name to normalize.

  • engine ({'h5netcdf', 'netcdf4', 'joblib', 'zarr'}) – Engine determining the extension.

Returns:

File name with an appropriate extension appended.

Return type:

str

xyzpy.manage.save_ds(ds, file_name, engine='h5netcdf', **kwargs)[source]

Saves a xarray dataset.

Parameters:
  • ds (xarray.Dataset) – The dataset to save.

  • file_name (str) – Name of the file to save to.

  • engine ({'h5netcdf', 'netcdf4', 'joblib', 'zarr'}, optional) – Engine used to save file with.

Return type:

None

xyzpy.manage.load_ds(file_name, engine='h5netcdf', load_to_mem=None, create_new=False, chunks=None, **kwargs)[source]

Loads a xarray dataset. Basically xarray.open_dataset with some different defaults and convenient behaviour.

Parameters:
  • file_name (str) – Name of file to open.

  • engine ({'h5netcdf', 'netcdf4', 'joblib', 'zarr'}, optional) – Engine used to load file.

  • load_to_mem (bool, optional) – Ince opened, load from disk into memory. Defaults to True if chunks=None.

  • create_new (bool, optional) – If no file exists make a blank one.

  • chunks (int or dict) – Passed to xarray.open_dataset so that data is stored using dask.array.

Returns:

ds – Loaded Dataset.

Return type:

xarray.Dataset

xyzpy.manage.save_merge_ds(ds, fname, overwrite=None, **kwargs)[source]

Save dataset ds, but check for an existing dataset with that name first, and if it exists, merge the two before saving.

Parameters:
  • ds (xarray.Dataset) – The dataset to save.

  • fname (str) – The file name.

  • overwrite ({None, False, True}, optional) –

    How to merge the dataset with the existing dataset.

    • None: the datasets will be merged in there are no conflicts

    • False: data will be taken from old dataset if conflicting

    • True: data will be taken from new dataset if conflicting

xyzpy.manage.trimna(obj)[source]

Drop values across dims where all values are NaN.

Parameters:

obj (xarray.Dataset or xarray.DataArray) – Object to trim.

Returns:

Trimmed object.

Return type:

same type as obj

xyzpy.manage.sort_dims(ds)[source]

Reorder variable dimensions to match ds.dims. This is an inplace operation.

Parameters:

ds (xarray.Dataset) – Dataset to reorder in place.

Return type:

None

xyzpy.manage.post_fix(ds, postfix)[source]

Append "_{postfix}" to each data variable name.

Parameters:
Returns:

Renamed dataset.

Return type:

xarray.Dataset

xyzpy.manage.check_runs(obj, dim='run', var=None, sel=())[source]

Print out information about the range and any missing values for an integer dimension.

Parameters:
  • obj (xarray object) – Data to check.

  • dim (str (optional)) – Dimension to check, defaults to ‘run’.

  • var (str (optional)) – Subselect this data variable first.

  • sel (mapping (optional)) – Subselect these other coordinates first.

xyzpy.manage.auto_xyz_ds(x, y_z=None)[source]

Automatically turn an array into a xarray dataset. Transpose y_z if necessary to automatically match dimension sizes.

Parameters:
  • x (array_like) – The x-coordinates.

  • y_z (array_like, optional) – The y-data, possibly varying with coordinate z.

xyzpy.manage.merge_sync_conflict_datasets(base_name, engine='h5netcdf', combine_first=False)[source]

Glob files based on base_name, merge them, save this new dataset if it contains new info, then clean up the conflicts.

Parameters:
  • base_name (str) – Base file name to glob on - should include ‘*’.

  • engine (str , optional) – Load and save engine used by xarray.

  • combine_first (bool, optional) – If True, combine datasets sequentially using combine_first, preferring the first dataset in the list, which is assumed to be the original. If False, merge all datasets together using xr.merge, which will raise an error if there are any conflicts.

xyzpy.manage.save_df(df, name, engine='pickle', key='df', **kwargs)[source]

Save a dataframe to disk.

Parameters:
  • df (pandas.DataFrame) – DataFrame to save.

  • name (str) – File name to save to.

  • engine ({'pickle', 'csv', 'hdf'}, optional) – Storage backend.

  • key (str, optional) – HDF key when engine='hdf'.

  • **kwargs – Passed through to the pandas writer.

xyzpy.manage.load_df(name, engine='pickle', key='df', **kwargs)[source]

Load a dataframe from disk.

Parameters:
  • name (str) – File name to read from.

  • engine ({'pickle', 'csv', 'hdf'}, optional) – Storage backend.

  • key (str, optional) – HDF key when engine='hdf'.

  • **kwargs – Passed through to the pandas reader.

Returns:

Loaded dataframe.

Return type:

pandas.DataFrame