Add ability to derive variables and add selected derived forcings #34

ealerskans · 2024-11-06T12:20:52Z

Implements the ability to derive fields from the input datasets, as discussed in Deriving forcings #29.

At the moment, I have only added the possibility to derive the following forcings:

top-of-atmosphere radiation
hour of day (cyclically encoded)
day of year (cyclically encoded)
~~time of year (cyclically encoded)~~

But additional variables, such as boundary and land-sea masks, should be added. But I think that is for another PR.

- Update the configuration file so that we list the dependencies and the method used to calculate the derived variable instead of having a flag to say that the variables should be derived. This approach is temporary and might be revised soon. - Add a new class in mllam_data_prep/config.py for derived variables to distinguish them from non-derived variables. - Updates to mllam_data_prep/ops/loading.py to distinguish between derived and non-derived variables. - Move all functions related to forcing derivations to a new and renamed function (mllam_data_prep/ops/forcings.py).

…lated

… dataset

…eck the attributes of the derived variable data-array

…le individually

…ed attributes

…doesn't have all dimensions. This way we don't need to broadcast these variables explicitly to all dimensions.

dataset - Output dataset is created in 'create_dataset' instead of in the 'subset_dataset' and 'derive_variables' functions. - Rename dataset variables to make it clearer what they are and also make them more consistent between 'subset_dataset' and 'derive_variables'. - Add function for aligning the derived variables to the correct output dimensions. - Move the 'derived_variables' from their own dataset in the example config file to the 'danra_surface' dataset, as it is now possible to combine them.

…hat either 'variables' or 'derived_variables' are included and that if both are included, they don't contain the same variable names

…ariables' in an input dataset in the config file

…am_data_prep/ops/derive_variable/dispatcher.py'

leifdenby

Really looking great! I will find a time in our calendar tomorrow so we can have a look together :)

leifdenby · 2025-01-09T17:30:16Z

mllam_data_prep/config.py

@@ -301,6 +328,54 @@ class Config(dataclass_wizard.JSONWizard, dataclass_wizard.YAMLWizard):
    class _(JSONWizard.Meta):
        raise_on_unknown_json_key = True

+    @staticmethod
+    def load_config(*args, **kwargs):


Great that you've put this validation code all together. Instead of creating a new method called load_config I think it would be better to use the __post_init__() method (https://docs.python.org/3/library/dataclasses.html#dataclasses.__post_init__) that dataclasses can implement though. I would then make your validation code here be a function that isn't a member of the Config class but just put the global scope of this module (you could maybe call i validate_config?). Using __post_init__ will ensure this method is automatically called whatever the dataclass instance is initialised from (be it a yaml file, yaml formatted string, json or directly as a Config object)

Really nice method! I have tried to implement it as suggested.

leifdenby · 2025-01-09T17:31:52Z

README.md

@@ -103,7 +103,7 @@ The package can also be used as a python module to create datasets directly, for
 import mllam_data_prep as mdp

 config_path = "example.danra.yaml"
-config = mdp.Config.from_yaml_file(config_path)
+config = mdp.Config.load_config(config_path)


if you make the change I suggested to use __post_init__ on the Config dataclass then you can still use from_yaml_file here which makes it clear we are just using that method as implemented by dataclasses-wizard :)

leifdenby · 2025-01-09T17:32:20Z

README.md

@@ -175,6 +175,18 @@ inputs:
    variables:
      # use surface incoming shortwave radiation as forcing
      - swavr0m
+    derived_variables:


leifdenby · 2025-01-09T17:34:16Z

mllam_data_prep/create_dataset.py

@@ -120,23 +125,50 @@ def create_dataset(config: Config):

    output_config = config.output
    output_coord_ranges = output_config.coord_ranges
+    chunking_config = config.output.chunking

    dataarrays_by_target = defaultdict(list)

    for dataset_name, input_config in config.inputs.items():
        path = input_config.path
        variables = input_config.variables


maybe we should rename this selected_variables? What do you think? It is a bit confusing to have both variables and derived_variables, maybe calling it selected_variables will make it clear that this part of the code deals with the ones that are "selected" from the input dataset

Yes, I think that is a good idea. However, I think it makes sense to still have it named variables in the config file and not rename it there as well. What do you think?

leifdenby · 2025-01-09T17:35:37Z

mllam_data_prep/create_dataset.py

@@ -255,7 +286,7 @@ def create_dataset_zarr(fp_config, fp_zarr: str = None):
        The path to the zarr file to write the dataset to. If not provided, the zarr file will be written
        to the same directory as the config file with the extension changed to '.zarr'.
    """
-    config = Config.from_yaml_file(file=fp_config)
+    config = Config.load_config(file=fp_config)


same here for the .from_yaml_file call :)

leifdenby · 2025-01-09T20:15:18Z

mllam_data_prep/ops/derived_variables.py

+    # Where TOA radiation is negative, set to 0
+    toa_radiation = xr.where(solar_constant * cos_sza < 0, 0, solar_constant * cos_sza)
+
+    if isinstance(toa_radiation, xr.DataArray):


leifdenby · 2025-01-09T20:25:03Z

mllam_data_prep/ops/derived_variables.py

+
+    Returns
+    -------
+    hour_of_day_cos: Union[xr.DataArray, float]


It might be simpler (and more explicit which is often good) to change the config to explicitly call these features hour_of_day_cos and hour_of_day_sin, and that way each function only every returns one derived variable (which makes the rest of the code simpler I think)

leifdenby · 2025-01-09T20:27:39Z

mllam_data_prep/ops/derived_variables.py

+        Sine part of cyclically encoded input data
+    """
+
+    data_sin = np.sin((data / data_max) * 2 * np.pi)


you could split this into two functions or make cos or sin (as strings) an argument to this function say cyclically_encode_values(values, max_value, component='cos'). But inline might be better to make this more explicit (since it is only one line)

leifdenby · 2025-01-09T20:28:27Z

mllam_data_prep/ops/subsetting.py

+    chunks = {
+        dim: chunking.get(dim, int(ds_subset[dim].count())) for dim in ds_subset.dims
+    }
+    ds_subset = ds_subset.chunk(chunks)


if the chunking is done here already (during load), why do we need to do it again later when deriving variables?

leifdenby · 2025-01-09T20:29:18Z

tests/test_derived_variables.py

I don't quite understand the mocking, patching and tests in here :) but let's talk about it so I can learn!

…guments that are not data extracted from the input dataset. Update the 'hour_of_day' variable so that it is now specified in the config file which cyclically encoded component is to be derived (sin or cos) and make 'calculate_hour_of_day' only return one component, based on the extra_kwargs 'component' supplied.

…ting) attrs for derived variables

ealerskans added 12 commits October 28, 2024 14:34

First attempt at adding derived forcings

981d676

Add derivation of cyclic encoded hour of day and day of year

f37161c

Add derivation of cyclic encoded time of year

71afd3a

Update and add docstrings

abb626b

Remove time_of_year

8b1f18e

Provide the full namespace of the function

7854013

Rename the module with derived variables

7fa90bf

Rename the function used for deriving variables

48c9e3e

Redefine the config file for derived variables and how they are calcu…

8de9404

…lated

Remove derived variables from 'load_and_subset_dataset'

ffc030c

Add try/except for derived variables when loading the dataset

692cdd3

leifdenby mentioned this pull request Nov 18, 2024

Roadmap #5

Closed

13 tasks

leifdenby modified the milestones: v0.4.0, v0.6.0 Nov 18, 2024

ealerskans added 13 commits December 5, 2024 08:54

Chunk the input data with the defined output chunks

c0cd875

Update toa_radiation function name

55224f3

Correct kwargs usage, add back dropped coordinates and return correct…

678ea52

… dataset

Prepare for hour_of_day and day_of_year

9d2db07

Add optional 'attributes' to the config of 'derived_variables' and ch…

26455bc

…eck the attributes of the derived variable data-array

Add dummy function for getting lat,lon (preparation for mllam#33)

fbb6065

Add function for chunking data and checking the chunk size

3a12f48

Add back coordinates on the subset instead of for each derived variab…

3ace219

…le individually

Add 'hour_of_day' to example config

a6b61b0

Merge branch 'main' into feature/derive_forcings

1814297

Rename derived variables dataset section in the example config

9dcace6

Remove f-string from 'name_format'

aba6757

Update README

143edb6

ealerskans changed the title ~~WIP: Add selected derived forcings~~ Add ability to derive variables and add selected derived forcings Dec 10, 2024

ealerskans added 21 commits December 17, 2024 13:29

Do not add 'attributes' to docstring

dbd5bfd

Remove unnecessary exception handling

474a83d

Move 'subset_dataset' to 'ops.subsetting'

1da66e2

Move 'derived_variables' to 'ops'

dc7dc5e

Move chunk size check to 'chunking' module

c9e96af

Add module docstring

47b8411

Update tests

5ae772f

Add global REQUIRED_FIELD_ATTRIBUTES var and updated check for requir…

2c0bdf8

…ed attributes

Update long name for toa_radiation

f1ce6d1

Update README

58d8af6

Return dropped coordinates to the data-arrays instead

f87b954

Adds dims to the dataset to make it work with derived variables that …

80cf058

…doesn't have all dimensions. This way we don't need to broadcast these variables explicitly to all dimensions.

Update README

f61a3b6

Add 'load_config' function, which wraps 'from_yaml_file' and checks t…

554f869

…hat either 'variables' or 'derived_variables' are included and that if both are included, they don't contain the same variable names

Update README

085aae3

Move 'chunk_dataset' to the chunking module

980e511

Update error message for when missing both 'variables' and 'derived_v…

b6e80d5

…ariables' in an input dataset in the config file

Move the deriving-functions to a separate module

d6c1b36

Update tests

f1e67bc

Rename (and move): 'mllam_data_prep/ops/derived_variables.py' -> 'mll…

89e9ad8

…am_data_prep/ops/derive_variable/dispatcher.py'

leifdenby reviewed Jan 9, 2025

View reviewed changes

ealerskans added 8 commits January 15, 2025 06:44

Use the __post_init__() method to validate the config

bdf3466

Loop over 'variables' in 'create_dataset'

d3c8693

Update file structure

0fc31bf

Add comment as to why chunking of coordinates is needed

6a7a1e3

Loop over 'derived_variables' in 'create_dataset'

92ad379

Update 'calculate_day_of_year' to only return one component (sin or cos)

e158a6c

Do not modify the arguments in the function for checking (and now get…

ff9acc7

…ting) attrs for derived variables

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ability to derive variables and add selected derived forcings #34

Add ability to derive variables and add selected derived forcings #34

ealerskans commented Nov 6, 2024 •

edited

Loading

leifdenby left a comment

leifdenby Jan 9, 2025

ealerskans Jan 15, 2025

leifdenby Jan 9, 2025

leifdenby Jan 9, 2025

leifdenby Jan 9, 2025

ealerskans Jan 15, 2025 •

edited

Loading

leifdenby Jan 9, 2025

leifdenby Jan 9, 2025

leifdenby Jan 9, 2025

leifdenby Jan 9, 2025

leifdenby Jan 9, 2025

leifdenby Jan 9, 2025

Add ability to derive variables and add selected derived forcings #34

Are you sure you want to change the base?

Add ability to derive variables and add selected derived forcings #34

Conversation

ealerskans commented Nov 6, 2024 • edited Loading

leifdenby left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ealerskans Jan 15, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ealerskans commented Nov 6, 2024 •

edited

Loading

ealerskans Jan 15, 2025 •

edited

Loading