All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- add ability to derive variables from input datasets #34, @ealerskans, @mafdmi
- add github PR template to guide development process on github #44, @leifdenby
- add support for zarr 3.0.0 and above #51, @kashif
This release adds support for an optional extra
section in the config file (for user-defined extra information that is ignored by mllam-data-prep
) and fixes a few minor issues. Note that to use extra
section in the config file the schema version in the config file must be increased to v0.5.0
.
- Add optional section called
extra
to config file to allow for user-defined extra information that is ignored bymllam-data-prep
but can be used by downstream applications. , @leifdenby
- remove f-string from
name_format
in config examples #35 - replace global config for
dataclass_wizard
onmllam_data_prep.config.Config
with config specific to that dataclass (to avoid conflicts with other uses ofdataclass_wizard
) #36 - Schema version bumped to
v0.5.0
to match release version that supports optionalextra
section in config #18
This release adds support for defining the output path in the command line
interface and addresses bugs around optional dependencies for
dask.distributed
.
- fix bug by making dependency
distributed
optional - change config example to call validation split
val
instead ofvalidation
#28 - fix typo in install dependency
distributed
- add missing
psutil
requirement. #21.
- add support for parallel processing using
dask.distributed
with command line flags--dask-distributed-local-core-fraction
and--dask-distributed-local-memory-fraction
to control the number of cores and memory to use on the local machine.
-
add support for creating dataset splits (e.g. train, validation, test) through
output.splitting
section in the config file, and support for optionally compute statistics for a given split (withoutput.splitting.splits.{split_name}.compute_statistics
). . -
include
units
andlong_name
attributes for all stacked variables as{output_variable}_units
and{output_variable}_long_name
.
-
split dataset creation and storage to zarr into separate functions
mllam_data_prep.create_dataset(...)
andmllam_data_prep.create_dataset_zarr(...)
respectively -
changes to spec from v0.1.0:
- the
architecture
section has been renamedoutput
to make it clearer that this section defines the properties of the output ofmllam-data-prep
sampling_dim
removed fromoutput
(previouslyarchitecture
) section of spec, this is not needed to create the training data- the variables (and their dimensions) of the output definition has been
renamed from
architecture.input_variables
tooutput.variables
- coordinate value ranges for the dimensions of the output (i.e. what that
the architecture expects as input) has been renamed from
architecture.input_ranges
tooutput.coord_ranges
to make the use more clear - selection on variable coordinates values is now set with
inputs.{dataset_name}.variables.{variable_name}.values
rather thaninputs.{dataset_name}.variables.{variable_name}.sel
- when dimension-mapping method
stack_variables_by_var_name
is used the formatting string for the new variable is now calledname_format
rather thanname
- when dimension-mapping is done by simply renaming a dimension this
configuration now needs to be set by providing the named method (
rename
) explicitly through themethod
key, i.e. rather than{to_dim}: {from_dim}
it is now{to_dim}: {method: rename, dim: {from_dim}}
to match the signature of the other dimension-mapping methods. - attribute
inputs.{dataset_name}.name
attribute has been removed, with the keydataset_name
this is superfluous
- the
-
relax minimuim python version requirement to
>3.8
to simplify downstream usage
First tagged release of mllam-data-prep
which includes functionality to
declaratively (in a yaml-config file) describe how the variables and
coordinates of a set of zarr-based source datasets are mapped to a new set of
variables with new coordinates to single a training dataset and write this
resulting single dataset to a new zarr dataset. This explicit mapping gives the
flexibility to target different different model architectures (which may
require different inputs with different shapes between architectures).