-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Output predictions as zarr dataset #89
Comments
This is a great idea. Some thoughts that come to mind:
So to summarize, this is clearly required. I suggest to make the output format flexible with sensible defaults. Also to adhere to pytorch's |
Yes this would be great! (And is very much needed for our ongoing research work, we just haven't had time to put time on it). I agree that it could be nice to allow for different output formats, but I think we should start with zarr+xarray. That can also easily be converted to other things. The only saving of actual forecast right now is indeed neural-lam/neural_lam/models/ar_model.py Line 491 in c3c1722
I have a script that saves from neural-lam to zarr on the global branch: https://github.com/mllam/neural-lam/blob/prob_model_global/neural_lam/forecast_to_xarr.py It's a bit specific to that setting, but could maybe be useful to start from. In there I introduced arguments for saving and filtering what to save (https://github.com/mllam/neural-lam/blob/ac1bd6a5026104b58f4946c3310e639fa5c8a9a2/train_model.py#L239-259), but other approaches are definitely possible. |
Summarizing a dicsussion with @leifdenby @joeloskarsson @TomasLandelius about this PR. We should prioritize a bit and @sadamov's previous comment was asking for changes that are too broad. For now we need three things to run experiments, which we suggest to tackle in this PR. Let us know @khintz if you agree:
Since everyone is already tagges in this message, remind me of things I forgot. |
|
I plan to implement this using the existing create_dataarray_from_tensor function. Here's the detailed plan:
I have a couple of questions before implementing the changes. Maybe @leifdenby @joeloskarsson or @sadamov can help me with this.
|
A thing to keep in mind when working on this is that we want to preserve all the current functionality to run the whole evaluation in the test step, with the saving of a zarr dataset as an alternative. So practice you then have two version of the test loop:
I don't think there is a need to do 1 above when you are saving forecasts. I don't know if it's a bad idea to have these dual purposes of the test step? It is a bit unintuitive that the test step is for saving forecasts, as that is not what testing is. But it might be the best solution. It is also possible to use the neural-lam/neural_lam/forecast_to_xarr.py Line 341 in ac1bd6a
I think being able to give a path using e.g.
Don't have any strong opinions about this, but probably enough to chunk over time. I think we could just fix a reasonable chunking strategy. Another question is if this should save an |
Thanks a lot for looking into this @elbdmi! I agree with many things you and @joeloskarsson have discussed above. Below a few points that came to my mind after reading through it:
Could we instead of creating a zarr-group for batches just append to the zarr dataset along the
As Joel correctly mentions, this is only possible for
|
Thank you, @joeloskarsson and @sadamov, for the detailed feedback! Based on your input, here’s how I’m planning to proceed: Zarr Saving:
As for chunking, I will define a chunking strategy where the data is chunked along the As for test step vs. predict step, while overloading the I will also add a command-line argument (e.g., |
In anticipation that we might want to return to the grid-layout (unstacked), variables etc that data originally came from, I have started work on functionality in |
Hmm, @leifdenby mention some good points above and I think I was thinking a bit too quickly about this. The input data (in the case of mdp, as Xarray+zarr) really exists in two possible versions:
For saving predictions then, if we would immediately save an So it seems that the best output format would be something that conforms to any of the two formats above (as close as possible, e.g. the original data can come from multiple zarrs, but I don't think we should invert back to saving predictions in multiple zarrs). We could have an option that specifies which of the two formats we want to output as (i.e. I agree that it makes the most sense to implement the "invertion" of data-preparation also in mdp. What I would really like to avoid is that you have to first save predictions to disk in format 2 above, and then run another script to invert that back to format 1. It should bey much be possible to write to format 1 directly using append writes to an |
Yes, I agree. Going directly to format 1 would be ideal. Since we name the inputs in For what @elbdmi is doing though I think it might be good to aim for implementing format 2 (the training optimised zarrs) for now since just getting there will be quite a bit of work. Then in the meantime I (or someone else) can complete the work on implementing the inversion in When it comes to appending to dataset: xarray has native support for only writing part of the full dataset using the Regarding chunking I would chunk along the |
Sounds reasonable, just keep in mind for both work here and in mdp that we will want this functionality to write a batch of predictions at a time by appending to a
I don't understand the point of having |
Yes, it probably makes most sense to use |
When running evaluation mode figures are being created. Currently, there is no option to output the predictions.
I would like to implement the option to output predictions as a Zarr dataset.
(WIP: https://github.com/khintz/neural-lam/tree/feature/output).
This should be possible by converting the
predictions
tensor from the functiontest_step
to a Zarr dataset.This could be done by utilising the existing function
create_dataarray_from_tensor
https://github.com/mllam/neural-lam/blob/main/neural_lam/weather_dataset.py#L509
and then simply using xarray to output to zarr.
The text was updated successfully, but these errors were encountered: