Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to derive variables and add selected derived forcings #34

Open
wants to merge 83 commits into
base: main
Choose a base branch
from

Conversation

ealerskans
Copy link
Contributor

@ealerskans ealerskans commented Nov 6, 2024

Implements the ability to derive fields from the input datasets, as discussed in Deriving forcings #29.

At the moment, I have only added the possibility to derive the following forcings:

  • top-of-atmosphere radiation
  • hour of day (cyclically encoded)
  • day of year (cyclically encoded)
  • time of year (cyclically encoded)

But additional variables, such as boundary and land-sea masks, should be added. But I think that is for another PR.

- Update the configuration file so that we list the dependencies
and the method used to calculate the derived variable instead
of having a flag to say that the variables should be derived.
This approach is temporary and might be revised soon.
- Add a new class in mllam_data_prep/config.py for derived
variables to distinguish them from non-derived variables.
- Updates to mllam_data_prep/ops/loading.py to distinguish
between derived and non-derived variables.
- Move all functions related to forcing derivations to a new
and renamed function (mllam_data_prep/ops/forcings.py).
@leifdenby leifdenby mentioned this pull request Nov 18, 2024
13 tasks
@leifdenby leifdenby modified the milestones: v0.4.0, v0.6.0 Nov 18, 2024
@ealerskans ealerskans changed the title WIP: Add selected derived forcings Add ability to derive variables and add selected derived forcings Dec 10, 2024
…doesn't have all dimensions. This way we don't need to broadcast these variables explicitly to all dimensions.
dataset

- Output dataset is created in 'create_dataset' instead of in the
  'subset_dataset' and 'derive_variables' functions.
- Rename dataset variables to make it clearer what they are and also
  make them more consistent between 'subset_dataset' and
  'derive_variables'.
- Add function for aligning the derived variables to the correct
  output dimensions.
- Move the 'derived_variables' from their own dataset in the example
  config file to the 'danra_surface' dataset, as it is now possible to
  combine them.
…hat either 'variables' or 'derived_variables' are included and that if both are included, they don't contain the same variable names
…ariables' in an input dataset in the config file
…am_data_prep/ops/derive_variable/dispatcher.py'
Copy link
Member

@leifdenby leifdenby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really looking great! I will find a time in our calendar tomorrow so we can have a look together :)

@@ -301,6 +328,54 @@ class Config(dataclass_wizard.JSONWizard, dataclass_wizard.YAMLWizard):
class _(JSONWizard.Meta):
raise_on_unknown_json_key = True

@staticmethod
def load_config(*args, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great that you've put this validation code all together. Instead of creating a new method called load_config I think it would be better to use the __post_init__() method (https://docs.python.org/3/library/dataclasses.html#dataclasses.__post_init__) that dataclasses can implement though. I would then make your validation code here be a function that isn't a member of the Config class but just put the global scope of this module (you could maybe call i validate_config?). Using __post_init__ will ensure this method is automatically called whatever the dataclass instance is initialised from (be it a yaml file, yaml formatted string, json or directly as a Config object)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice method! I have tried to implement it as suggested.

@@ -103,7 +103,7 @@ The package can also be used as a python module to create datasets directly, for
import mllam_data_prep as mdp

config_path = "example.danra.yaml"
config = mdp.Config.from_yaml_file(config_path)
config = mdp.Config.load_config(config_path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you make the change I suggested to use __post_init__ on the Config dataclass then you can still use from_yaml_file here which makes it clear we are just using that method as implemented by dataclasses-wizard :)

@@ -175,6 +175,18 @@ inputs:
variables:
# use surface incoming shortwave radiation as forcing
- swavr0m
derived_variables:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

beautiful!

@@ -120,23 +125,50 @@ def create_dataset(config: Config):

output_config = config.output
output_coord_ranges = output_config.coord_ranges
chunking_config = config.output.chunking

dataarrays_by_target = defaultdict(list)

for dataset_name, input_config in config.inputs.items():
path = input_config.path
variables = input_config.variables
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should rename this selected_variables? What do you think? It is a bit confusing to have both variables and derived_variables, maybe calling it selected_variables will make it clear that this part of the code deals with the ones that are "selected" from the input dataset

Copy link
Contributor Author

@ealerskans ealerskans Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that is a good idea. However, I think it makes sense to still have it named variables in the config file and not rename it there as well. What do you think?

@@ -255,7 +286,7 @@ def create_dataset_zarr(fp_config, fp_zarr: str = None):
The path to the zarr file to write the dataset to. If not provided, the zarr file will be written
to the same directory as the config file with the extension changed to '.zarr'.
"""
config = Config.from_yaml_file(file=fp_config)
config = Config.load_config(file=fp_config)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here for the .from_yaml_file call :)

# Where TOA radiation is negative, set to 0
toa_radiation = xr.where(solar_constant * cos_sza < 0, 0, solar_constant * cos_sza)

if isinstance(toa_radiation, xr.DataArray):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perfect!


Returns
-------
hour_of_day_cos: Union[xr.DataArray, float]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be simpler (and more explicit which is often good) to change the config to explicitly call these features hour_of_day_cos and hour_of_day_sin, and that way each function only every returns one derived variable (which makes the rest of the code simpler I think)

Sine part of cyclically encoded input data
"""

data_sin = np.sin((data / data_max) * 2 * np.pi)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could split this into two functions or make cos or sin (as strings) an argument to this function say cyclically_encode_values(values, max_value, component='cos'). But inline might be better to make this more explicit (since it is only one line)

chunks = {
dim: chunking.get(dim, int(ds_subset[dim].count())) for dim in ds_subset.dims
}
ds_subset = ds_subset.chunk(chunks)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the chunking is done here already (during load), why do we need to do it again later when deriving variables?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand the mocking, patching and tests in here :) but let's talk about it so I can learn!

…guments that are not data extracted from the input dataset. Update the 'hour_of_day' variable so that it is now specified in the config file which cyclically encoded component is to be derived (sin or cos) and make 'calculate_hour_of_day' only return one component, based on the extra_kwargs 'component' supplied.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants