Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output predictions as zarr dataset #89

Open
khintz opened this issue Nov 21, 2024 · 13 comments
Open

Output predictions as zarr dataset #89

khintz opened this issue Nov 21, 2024 · 13 comments
Assignees
Milestone

Comments

@khintz
Copy link
Contributor

khintz commented Nov 21, 2024

When running evaluation mode figures are being created. Currently, there is no option to output the predictions.
I would like to implement the option to output predictions as a Zarr dataset.
(WIP: https://github.com/khintz/neural-lam/tree/feature/output).

This should be possible by converting the predictions tensor from the function test_step to a Zarr dataset.

    def test_step(self, batch, batch_idx):
        """
        Run test on single batch
        """
        prediction, target, pred_std, batch_times = self.common_step(batch)

This could be done by utilising the existing function create_dataarray_from_tensor
https://github.com/mllam/neural-lam/blob/main/neural_lam/weather_dataset.py#L509
and then simply using xarray to output to zarr.

@sadamov
Copy link
Collaborator

sadamov commented Nov 21, 2024

This is a great idea. Some thoughts that come to mind:

  • Given we have implemented a flexible datastore setup for various input datatypes, it might make sense to give the same flexibility in the output. So basically the mdp datastore returns a zarr, the npy datastore returns a numpy-array, by default. The user can however always override this default and output to any given format.
  • There actually is a way to output predictions right now, it is hardcoded as pytorch tensors:
    # Save pred and target as .pt files
    I don't think it's good to have this inside plot_example but rather this could be replaced by your new functionality. (i.e. Output as tensor without any metadata for debugging.)
  • We are currently running inference and auto-regressive rollouts within pytorch's on_test_... logic. But actually pytorch would provide a on_predict_... set of functions. This PR suggested changing this and also making sure that predict_step() is properly used. https://github.com/mllam/neural-lam/pull/45/files Currently, neural-lam overwrites predict_step and uses it during training.
  • Here a first attempt at prediction output: https://github.com/mllam/neural-lam/pull/44/files It is based on GRIB specifically, but maybe still somewhat useful.

So to summarize, this is clearly required. I suggest to make the output format flexible with sensible defaults. Also to adhere to pytorch's predict_... logic (maybe in a separate PR). And finally get rid of the two Draft PRs listed above.

@joeloskarsson
Copy link
Collaborator

Yes this would be great! (And is very much needed for our ongoing research work, we just haven't had time to put time on it). I agree that it could be nice to allow for different output formats, but I think we should start with zarr+xarray. That can also easily be converted to other things.

The only saving of actual forecast right now is indeed

# Save pred and target as .pt files
. But note that that only saves one single example, so was not intended for saving all forecasts in the test set for evaluation later. I'm not sure if there is a need / interest in saving pytorch tensors for that.

I have a script that saves from neural-lam to zarr on the global branch: https://github.com/mllam/neural-lam/blob/prob_model_global/neural_lam/forecast_to_xarr.py It's a bit specific to that setting, but could maybe be useful to start from. In there I introduced arguments for saving and filtering what to save (https://github.com/mllam/neural-lam/blob/ac1bd6a5026104b58f4946c3310e639fa5c8a9a2/train_model.py#L239-259), but other approaches are definitely possible.

@leifdenby leifdenby self-assigned this Nov 26, 2024
@sadamov sadamov mentioned this issue Dec 3, 2024
21 tasks
@sadamov
Copy link
Collaborator

sadamov commented Dec 5, 2024

Summarizing a dicsussion with @leifdenby @joeloskarsson @TomasLandelius about this PR. We should prioritize a bit and @sadamov's previous comment was asking for changes that are too broad. For now we need three things to run experiments, which we suggest to tackle in this PR. Let us know @khintz if you agree:

  • Generate the same xarray objects that come into the model - after they came out of the model (here Leif has already done work on the vizualisation side of things)
  • Perform some stacking along ar_step and batch (e.g. each time-sample has now a forecast-dim lead_time)
  • Write the stacked object to disk, sensibly chunked and compressed

Since everyone is already tagges in this message, remind me of things I forgot.

@joeloskarsson
Copy link
Collaborator

@elbdmi elbdmi self-assigned this Dec 5, 2024
@joeloskarsson joeloskarsson added this to the v 0.5.0 (proposed) milestone Dec 10, 2024
@elbdmi
Copy link

elbdmi commented Dec 11, 2024

I plan to implement this using the existing create_dataarray_from_tensor function. Here's the detailed plan:

  1. Prediction Conversion: In the test_step method, predictions will be converted to xarray.DataArray objects using create_dataarray_from_tensor.
  2. Zarr Saving: Each batch will be saved under a unique group in the Zarr dataset (e.g., batch_0_predictions, batch_1_predictions).
  3. Testing: After implementation, I will test the changes to confirm correct functionality and proper organization of the Zarr dataset. I will also implement just a simple test to confirm that it is working correctly.

I have a couple of questions before implementing the changes. Maybe @leifdenby @joeloskarsson or @sadamov can help me with this.

  1. Output Path for Zarr File: Where should the Zarr file be saved? Should it be a fixed location specified in the code, or do we want to make this configurable via a command-line argument or configuration file?
  2. Chunking Strategy: Could you clarify how you’d like the chunking to be implemented? Should the chunks be predefined with specific dimensions (e.g., time, latitude, longitude), or should the chunk sizes adapt based on the dataset?

@joeloskarsson
Copy link
Collaborator

A thing to keep in mind when working on this is that we want to preserve all the current functionality to run the whole evaluation in the test step, with the saving of a zarr dataset as an alternative. So practice you then have two version of the test loop:

  1. Compute metrics for forecasts and log these
  2. Save forecasts to zarr

I don't think there is a need to do 1 above when you are saving forecasts. I don't know if it's a bad idea to have these dual purposes of the test step? It is a bit unintuitive that the test step is for saving forecasts, as that is not what testing is. But it might be the best solution. It is also possible to use the predict_step as mentioned by @sadamov above and something like

for batch in tqdm(dataloader):

Output Path for Zarr File: Where should the Zarr file be saved? Should it be a fixed location specified in the code, or do we want to make this configurable via a command-line argument or configuration file?

I think being able to give a path using e.g. --forecast_save_path would be nice.

Chunking Strategy: Could you clarify how you’d like the chunking to be implemented? Should the chunks be predefined with specific dimensions (e.g., time, latitude, longitude), or should the chunk sizes adapt based on the dataset?

Don't have any strong opinions about this, but probably enough to chunk over time. I think we could just fix a reasonable chunking strategy. Another question is if this should save an xarray.DataArray or xarray.DataSet (practically if variables should be in different files or not). I think DataSet would make most sense. Another consideration here is if the returned zarr should have 1 spatial dimension (grid-nodes) or 2 (x, y). We can't always guarantee that 2 spatial dimensions can be created (only if the datastore is BaseRegularGridDatastore I think?). Much of this depends on the further use of the zarr output for evaluation. I think @sadamov have some useful input wrt to that.

@sadamov
Copy link
Collaborator

sadamov commented Dec 12, 2024

Thanks a lot for looking into this @elbdmi! I agree with many things you and @joeloskarsson have discussed above. Below a few points that came to my mind after reading through it:

  1. Zarr Saving: Each batch will be saved under a unique group in the Zarr dataset (e.g., batch_0_predictions, batch_1_predictions).

Could we instead of creating a zarr-group for batches just append to the zarr dataset along the dim=time, as the batching is done across time? So the first batch writes the zarr archive (creates it) mode="w" and the following batches are appended along time mode="a"

  1. I agree that a xarray Dataset would fit perfectly to this output (better than DataArray). Below is a screenshot of a zarr-format that I currently use for verification. Since this is Danra data it's probably a good standard to use for output as well:
<xarray.Dataset> Size: 751MB
Dimensions:  (time: 100, y: 589, x: 789)
Coordinates:
    lat      (y, x) float64 4MB dask.array<chunksize=(589, 789), meta=np.ndarray>
    lon      (y, x) float64 4MB dask.array<chunksize=(589, 789), meta=np.ndarray>
  * time     (time) datetime64[ns] 800B 2020-09-01 ... 2020-09-13T09:00:00
  * x        (x) float64 6kB -1.999e+06 -1.997e+06 ... -3.175e+04 -2.925e+04
  * y        (y) float64 5kB -6.095e+05 -6.07e+05 ... 8.58e+05 8.605e+05
Data variables:
    u10      (time, y, x) float64 372MB dask.array<chunksize=(20, 589, 789), meta=np.ndarray>
    v10      (time, y, x) float64 372MB dask.array<chunksize=(20, 589, 789), meta=np.ndarray>
Attributes:
    description:  All prognostic variables for 10-year period on reduced levels

As Joel correctly mentions, this is only possible for BaseRegularGridDatastore. So maybe it's easiest for now to always return 1 spatial dimension (grid-nodes) - stacked. We just need to make sure that we have all the required information to unstack the output in the verification-notebook. I think we can again use existing functionality from the datastores here to stack/unstack?

  1. I would chunk across time and in the future also across ensemble_member once we allow that. There is an argument to be made whether we should also chunk the vertical level, as they are treated as independent 2D fields during verification. The optimal chunking will depend on the xy-resolution. I would only chunk by chunk({"time": 1}) and make this our defined chunking for the output, for now. Chunking by time is also very natural when you create the zarr-archive by appending to the time-dim as mentioned in (1.)

  2. With regards to "overloading" the test_step vs. re-writing the logic with a predict_step. I leave that up to you and it highly depends on how much time you have to make the code "beautiful" vs. "functional" Both options work for me. 👍 I would chose one that doesn't take weeks to implement.

@elbdmi
Copy link

elbdmi commented Dec 12, 2024

Thank you, @joeloskarsson and @sadamov, for the detailed feedback! Based on your input, here’s how I’m planning to proceed:

Zarr Saving:

  • Append to Time Dimension: Instead of creating separate groups for each batch, I’ll append the data along the time dimension. This means the first batch will create the Zarr archive (mode="w"), and subsequent batches will append to it (mode="a").
  • Output as xarray.Dataset: I will write the predictions and related data as an xarray.Dataset, which aligns with the DANRA standard mentioned. This format provides better organization and compatibility for multi-variable outputs.

As for chunking, I will define a chunking strategy where the data is chunked along the time dimension, using chunk({"time": 1}) for now. This should make appending straightforward and ensure manageable chunk sizes. In the future, chunking may extend to additional dimensions, such as ensemble_member or vertical levels, but for now, I’ll focus on time only.

As for test step vs. predict step, while overloading the test_step is a bit unconventional, it allows us to preserve current functionality while adding Zarr saving as an alternative. I will ensure the two modes are clearly separated in the logic to avoid confusion. Alternatively, if time permits, I could implement the logic in a predict_step to align better with its intended purpose. This approach might make the codebase cleaner, depending on the broader requirements.

I will also add a command-line argument (e.g., --forecast_save_path) to allow users to specify the output location for the Zarr archive.

@leifdenby
Copy link
Member

As Joel correctly mentions, this is only possible for BaseRegularGridDatastore. So maybe it's easiest for now to always return 1 spatial dimension (grid-nodes) - stacked.

In anticipation that we might want to return to the grid-layout (unstacked), variables etc that data originally came from, I have started work on functionality in mllam-data-prep to completely invert the transformation operations that it does (my wip work is on the following branch: https://github.com/leifdenby/mllam-data-prep/tree/feat/inverse-ops). To me it would make sense to have that functionality there (since we can then also test there that this functionality actually works as expected). So, I support writing stacked (transformed) xr.DataArrays as they are created from WeatherDataset.create_dataarray_from_tensor() for now and then each datastore could (eventually) implement functionality to invert the transformed data back to the structure it came from. We could then have a second argument (something like --invert-predictions-data-structure, though that is too long...). Does that make sense?

@joeloskarsson
Copy link
Collaborator

Hmm, @leifdenby mention some good points above and I think I was thinking a bit too quickly about this.

The input data (in the case of mdp, as Xarray+zarr) really exists in two possible versions:

  1. The original zarr-file(s), typically as xr.Dataset, before being processed by mdp.
  2. The training-optimized zarrs output from mdp, as xr.Dataarray.

For saving predictions then, if we would immediately save an xr.Dataset as I hand-wavingly descirbed above it might not perfectly "match" any of these formats. This means that when you want to do the evaluation there will always be extra steps for re-shaping/formatting in order to make your predictions format match the dataset.

So it seems that the best output format would be something that conforms to any of the two formats above (as close as possible, e.g. the original data can come from multiple zarrs, but I don't think we should invert back to saving predictions in multiple zarrs). We could have an option that specifies which of the two formats we want to output as (i.e. --invert-predictions-data-structure), but I would imagine that you practically always want format 1 above, so it might be enough to allow for that?

I agree that it makes the most sense to implement the "invertion" of data-preparation also in mdp. What I would really like to avoid is that you have to first save predictions to disk in format 2 above, and then run another script to invert that back to format 1. It should bey much be possible to write to format 1 directly using append writes to an xr.Dataset. As saved predictions can get really huge, having the same predictions saved twice (even temporarily) can take up prohibitively much disk space.

@leifdenby
Copy link
Member

leifdenby commented Dec 13, 2024

I agree that it makes the most sense to implement the "invertion" of data-preparation also in mdp. What I would really like to avoid is that you have to first save predictions to disk in format 2 above, and then run another script to invert that back to format 1. It should bey much be possible to write to format 1 directly using append writes to an xr.Dataset. As saved predictions can get really huge, having the same predictions saved twice (even temporarily) can take up prohibitively much disk space.

Yes, I agree. Going directly to format 1 would be ideal. Since we name the inputs in mdp I was thinking we could maybe use these names to come up with the filenames the zarr datasets to write to. I say datasets because a training-optimised dataset might have been made from multiple source zarr datasets. We could also try and write all the inputs to a single zarr dataset, but there is the risk of collisions between variables (for example DANRA has temperature on both pressure and height levels, and they have the same variable name t). So that is why I thought starting out by splitting into separate datasets might be the easiest. Another thing the path for the zarr output should maybe include is a reference to the model used. So we could have something like {model_identifier}.{datastore_name}.{mdp_input_name}.zarr} so that might be fiery-frog-42.danra.height_levels.zarr.

For what @elbdmi is doing though I think it might be good to aim for implementing format 2 (the training optimised zarrs) for now since just getting there will be quite a bit of work. Then in the meantime I (or someone else) can complete the work on implementing the inversion in mdp and then we can add a second PR for that. Does that sound ok? I would like to go directly to format 1, but that might be quite a lot to do all at once (and would depend on upstream work in mllam-data-prep).

When it comes to appending to dataset: xarray has native support for only writing part of the full dataset using the region argument to xr.Dataset.to_zarr(..., region=...) (https://docs.xarray.dev/en/stable/generated/xarray.Dataset.to_zarr.html). Using that one could write different batches into their own subset of a single zarr dataset that covers the whole dataset being used. I was going to suggest that once you have writing each batch to its own zarr dataset working @elbdmi, but you might want to go for that directly if the example in the xarray documentation is ok for you. Assuming that we will only write the state category of data, that would only be a single xr.DataArray, but we also want to save the start_time to disk I think, so the contents would be something like a state variable with coordinates [sample_number, ar_step, grid_index, state_feature] and maybe also a analysis_time variable with coordinate [sample_number]. We could alternatively drop sample_number and have just a state variable with coordinates [analysis_time, ar_step, grid_index, state_feature]. Finally, of course we could drop ar_step and simply have elapsed_forecast_duration, but I think that maybe including ar_step and sample_number might be good to make things as explicit as possible.

Regarding chunking I would chunk along the sample_number / analysis_time dimension (the name depending on what we choose above)

@joeloskarsson
Copy link
Collaborator

Sounds reasonable, just keep in mind for both work here and in mdp that we will want this functionality to write a batch of predictions at a time by appending to a xr.Dataset zarr.

Finally, of course we could drop ar_step and simply have elapsed_forecast_duration, but I think that maybe including ar_step and sample_number might be good to make things as explicit as possible.

I don't understand the point of having ar_step and sample_number present in the saved output. They are just the index of elapsed_forecast_duration and analysis_time correspondingly, so if one wants to use them that's already possible with .isel(...).

@leifdenby
Copy link
Member

I don't understand the point of having ar_step and sample_number present in the saved output. They are just the index of elapsed_forecast_duration and analysis_time correspondingly, so if one wants to use them that's already possible with .isel(...).

Yes, it probably makes most sense to use [analysis_time, elapsed_forecast_duration] as dimensions

@leifdenby leifdenby modified the milestones: v0.5.0 (proposed), v0.5.0 Jan 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants