Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Script to update e-picsa from R-Instat #9289

Open
rdstern opened this issue Dec 2, 2024 · 10 comments
Open

Script to update e-picsa from R-Instat #9289

rdstern opened this issue Dec 2, 2024 · 10 comments
Labels
Milestone

Comments

@rdstern
Copy link
Collaborator

rdstern commented Dec 2, 2024

@lilyclements I think this could be a very simple update, to what I understand is your current R script to update the google bucket, from Climsoft.

The reason I hope it is a simple "tweak" to your current script is that I suggest it should be a 2-step process as follows:

a) Step 1 is the import all the data for the update, into R-Instat. This could be for Zambia, or it could be province by Province, if there are so many stations that it is a bit scary for the ZMD staff to do it all in 1 step. Or it could even be for a few (new) stations.
b) Then they run your updating script.

I assume it is working already - from Climsoft - so all you need to do, is adapt it to run from data in R-Instat?

Neat, eh? If so, then this builds on all the changes you have been making in R-Instat recently. So that justifies that they were made, at least partly for e-picsa. That is to be able to import packages from github easily, and also the specific r-package for e-picsa. Could you then add the new code into that package?

This would then all fit very sweetly into our advanced workshop, which starts 9 December. We are teaching about using scripts also for the out-filling process. So perfect that they need to be able to run a script for e-picsa updates!

And it fits perfectly into my general plan for R-Instat, namely that we have demolished the steep learning curve, for a large group of potential users, who find the usual "learn R first" approach daunting. And it shows that (for them), using scripts is pretty easy, and much simpler than starting R by writing them. This facility would be an excellent example!

And we can test it all next week! It is one important step into making e-picsa a smooth process by the end of this contract.

@rdstern rdstern added this to the 0.8.1 milestone Dec 2, 2024
@jkmusyoka
Copy link
Contributor

I support. This is a sensible way of making it easy for ZMD. @lilyclements let me know when you have done the "tweak" and I will do the testing

@lilyclements
Copy link
Contributor

@rdstern @jkmusyoka happy to assist. However, I am unsure where you'd like me to make this tweak. You refer to a "current script", what or where is this current script?

You say
"Step 1 is the import all the data for the update, into R-Instat."
My understanding was that currently we have step 1 as import from climsoft into R-Instat.
What data are you referring to here / where do you want us to import it from to import into R-Instat? (Perhaps on reflection you mean to read into R-Instat the data that's currently in the google buckets)

@rdstern
Copy link
Collaborator Author

rdstern commented Dec 2, 2024

@lilyclements I thought you had an R script from before, to update the data in google buckets if another year has passed, and so you just wanted the new year included, and the stations and events were all the same as last year?

That's the one - if I'm not imagining things, that we suggested could be done from within R-Instat in stead?

@lilyclements
Copy link
Contributor

lilyclements commented Dec 6, 2024

@rdstern @jkmusyoka I have written a new function which currently updates only for annual rainfall summaries and monthly/annual temperature summaries. I can get to the other bits now, but, thought I should share this for now.

There are three parts:

  1. Importing from Climsoft
  2. Updating the Summaries from Definitions
  3. Exporting to Google buckets.

Only step 2. involves new code.

1. Importing from Climsoft

# Get Climsoft Data using Climatic > Import from Climsoft
# I tested with Lundazi data, but, you can test with any. I can share the script with that privately because it contains information on importing from Climsfot. 

data_book$database_connect(dbname="...", host="...", port=..., user="...")

# Dialog: Import From Climsoft
data_book$import_climsoft_data(table="observationfinal", station_filter_column="stationId", stations="LUNDAZ01", element_filter_column="elementName", elements=c("Precip  daily","Temp  daily min","Temp  daily max"))

# You then need to rearrange the data - pivot_wider and create relevant columns, like DOY and Year. 
# Is this something we want to be automated?
# Dialog: Unstack (Pivot Wider)
observations_data <- data_book$get_data_frame(data_name="observations_data")
observations_data_unstacked <- tidyr::pivot_wider(data=observations_data, names_from=element_abbrv, values_from=value)
data_book$import_data(data_tables=list(observations_data_unstacked=observations_data_unstacked))
rm(list=c("observations_data_unstacked", "observations_data"))

# Dialog: Use Date
data_book$split_date(data_name="observations_data_unstacked", col_name="date", year_val=TRUE, month_val=TRUE, day_in_year_366 =TRUE, s_start_month=1)

2. Updating the Summaries from Definitions

This is the new bit! You need to update your epicsawrap package for this to work, since it has a new function.

# read in token to access bucket
gcs_auth_file(file = "tests/testthat/testdata/epicsa_token.json")

# update our functions
annual_summaries_data <- data_book$get_data_frame("observations_data_unstacked")
annual_summaries_data <- update_rainfall_summaries_from_definition(country = "zm_workshops", station_id = "Lundazi Met", daily_data = annual_summaries_data)
data_book$import_data(data_tables=list(annual_summaries_data = annual_summaries_data))

# and for our temperature
monthly_temperature_summaries <- update_monthly_temperature_summaries_from_definition(country = "zm_workshops", station_id = "Lundazi Met", daily_data = observations_data_unstacked)
annual_temperature_summaries <- update_annual_temperature_summaries_from_definition(country = "zm_workshops", station_id = "Lundazi Met", daily_data = observations_data_unstacked)
data_book$import_data(data_tables=list(monthly_temperature_summaries = monthly_temperature_summaries))
data_book$import_data(data_tables=list(annual_temperature_summaries = annual_temperature_summaries))

3. Exporting to Google Buckets
This is just using our "Export to Google Buckets" dialog from before, but now, our data_by_year is the annual_summaries_data

annual_rain <- epicsawrap::reformat_annual_summaries(data=annual_summaries_data,
                                                     station_col="station_id"
                                                     year_col="year",
                                                     start_rains_doy_col="start_rains_doy",
                                                     start_rains_date_col="start_rains_date",
                                                     end_season_doy_col = "end_season_doy",
                                                     end_season_date_col = "end_season_date",
                                                     season_length_col = "season_length",
                                                     n_rain_col = "n_rain")

# similarly make changes for reformat_temperature_summaries

epicsawrap::export_r_instat_to_bucket(data_by_year = "annual_summaries_data",
                                      rain = "PRECIP",
                                      station = "station_id",
                                      year="year",
                                      month="month_val",
                                      summaries=c("annual_rainfall"),
                                      station_id = "station_id",
                                      definitions_id="999",
                                      country="zm_workshop",
                                      include_summary_data=TRUE,
                                      annual_rainfall_data = annual_rain,
                                      start_rains_column = "start_rains_doy",
                                      end_season_column = "end_season_doy",
                                      seasonal_length_column = "seasonal_length")

# amend epicsawrap::export_r_instat_to_bucket to include our changes to the temperature summaries too 

TODO

  1. I know one thing to do is to add in the ability for this to be repeated for multiple stations in step 2. This means reading in multiple definition files, and creating multiple summaries.
  2. I need this to work for other areas - such as crop probability summaries. I am not sure what the preference is -- this should be straight forward to do, but would we rather iron out the kinks in this system first, or for me to set it up?
  3. Data manipulations to climsoft data (unstacking and creating new columns) -- do we want this to be something you do in R-Instat or more automated in the updating functions? The former would mean fewer errors. I don't know how similar the different Climsoft data are to each other (i.e., is it always called PRECIP, TMPMAX, etc; do they not usually read in the DOY?)

@rdstern
Copy link
Collaborator Author

rdstern commented Dec 7, 2024

@lilyclements this seems great. For now I am also ok that it works just one station at a time. If the updates are to be from the individual provinces, then there are relatively few stations.
But I wonder if it currently works, or could work for a new station for which we want the same definitions? Ideally also a subset of the definitions? The main example would be extending e-picsa in a province to a rainfall station. Then we would not be able to include the temperature definitions, but the rainfall ones would be exactly the same as for the main stations in the province?

@lilyclements
Copy link
Contributor

Great! Good suggestion! I've implemented that now (see point 2.)

  1. You should now be able to update for update_rainfall_summaries_from_definition, update_annual_temperature_summaries_from_definition, update_monthly_temperature_summaries_from_definition, and update_season_start_probabilities_from_definition functions.
  2. You can now add in a new station, as long as you call a valid definitions file. To do this, run definition_id = "VALUE OF ID" and not station_id = "VALUE OF ID". (If you run station_id = "VALUE OF ID" it will get the definitions ID value for you).

E.g.,

update_rainfall_summaries_from_definition(country = "zm_workshops", definition_id = "002", daily_data = observations_data_unstacked)
data_book$import_data(data_tables=list(rainfall_summaries = rainfall_summaries))
  1. Now a bigger one! At the moment, we update the functions by running a version of the R code written by me, in rpicsa. However, I'm not 100% confident on what I have written. I much prefer the code written in the calculation system, for many reasons, but in part because it has been tested solidly and so is much more stable.
    It would be a big task, and might not even work, but, perhaps I could update our rpicsa functions to use the calculation system. It would be a task to test out our instatCalculations and databook R packages, and presumably not something urgently needed now. @rdstern what are your thoughts on this?
    The rpicsa functions are fine, otherwise. But, not perfect. Those changes we've been making to the status variable in the start and end rain dialogs are not in the rpicsa code yet, for example.

@rdstern
Copy link
Collaborator Author

rdstern commented Dec 8, 2024

@lilyclements I think using the calculation system may be what we are also suggesting? Currently when we use the start of the rains - which uses the calculation system it also generates a (sort of mysterious) definition, and this is exported to google buckets. Could the definition be less mysterious and simply become a "definition object" or "e-picsa object", which is added to the existing objects attached to the data sheet. So, like a graph object, which can be all sorts of graphs, or a filter object, etc.

Then we have the Prepare > R-objects menu to View, Rename, Reorder and Delete them.

Now I assume this would make the updating much more flexible and simpler to follow. The updating procedure simply (maybe) has a dialog to get (import?) the definition objects from google buckets, rather than from the dialog. So it could be in the file menu to Import from Google Buckets, which is a partner to the existing Export to Google Buckets dialog. I'm also liking the idea as we could also import the summary data. Then a new person could also check the graphs in R-Instat, etc that will be used by the app?

Now the import from google buckets needs either a daily data file, or it imports the summary data corresponding to those definitions. If the daily data file, then it runs the definitions, attaches the objects and now has summary data ready for the export. Of course these could be different stations, or updated existing stations.

If importing definitions, together with summary data, from google buckets, this is designed so you can prepare e-picsa type graphs in R-Instat to confirm that they are sensible and you can support them.

@jkmusyoka may wish to add?

@lilyclements I'm quite liking this scheme. We are teaching the export to google buckets on Wednesday. I would like to progress on our thinking and discussion by then, but would not expect any coding. And there isn't any rush after then. I'm hoping you agree that adding an Import from Google Buckets is a sensible addition. Then we could explain that this is coming. I also like the definition Objects, but that is too detailed to be discussed in the workshop.

@rdstern
Copy link
Collaborator Author

rdstern commented Dec 9, 2024

@lilyclements looking again at your message above - your point 3, I think my request to save a definition is maybe the same then as saving a calculation! We could check with @volloholic or @dannyparsons because we don't yet save calculations, but I think they always had in mind that we should.

That would then be great, because your definitions would then not be a special climatic e-picsa feature, but simply a special case of saving a calculation. And that fits perfectly with the whole idea of the data sheets and data book.

@lilyclements
Copy link
Contributor

lilyclements commented Dec 9, 2024

@rdstern I think I'm a few steps behind you. I'd like to catch up as what you're saying is very exciting! I think I'm on the right page after writing and rewriting and rerewriting this message. But, my summary and some questions:

The "Import from Google Buckets" dialog:

We can have two buttons at the top, something like: "Import/Update Definitions" and "Import Summaries"

Import/Update Definitions
You need to give a daily data file to import the definitions.
You also need to give either your station ID or a definitions ID: If you give your station ID you get the corresponding definitions ID for your station (if you are updating your station); if you give a definitions ID you are adding a new station.
This definitions file will combine with your daily data given, and update your data.
This is what the R code I have written in the last few days does.
However, this R code currently runs our definitions (like SoR) using the tidyverse not the calculation system. Do we want this to use our calculation system instead?
(I think we do, this shouldn't be a big job but is new territory at first on how to use our R packages on databook and instatCalculations. To do this, I suggest my written R code in the rpicsa package should instead be made from the calculation system instead of the tidyverse).

Import Summaries
You don't need to give a daily data file, instead, just need to give a definitions ID and station ID to get the summary data corresponding to that definition/station combination.
That then reads into R-Instat for you to view again. Nice.

(Out of interest, if we had a function which used the calculation system, is this something we would want to replace our current code in SoR/EoR dialogs?)

Definition Object bits:

  • You're suggesting our definition become a "definition object", which, I agree, is the same as saving a calculation! Nice!
  • Currently our "definition object" contains a lot of definitions (e.g., SoR, EoR, EoS, LoS, Crops, etc). I assume we would see these as separate definitions. That each "press of OK" = "1 definition", as it were?
  • So then we export all of our definitions as a "bundle" to google buckets.
  • Then you're suggesting we have a dialog to "Import from Google Buckets" - in this you can import your "bundle" of definitions (or summaries).
  • What if you want to import your definition from elsewhere. Lets say, the user saves their definitions from a previous session as a file?

@rdstern
Copy link
Collaborator Author

rdstern commented Dec 10, 2024

@lilyclements I think you are only saying you are behind - you have moved ahead - we have continued discussions here and @jkmusyoka should also reply to your message above. Also there is no rush, so we have time to reflect. On a time scale I'm hoping we might have a workshop for the staff from the provinces in perhaps April - not before. If we do, then this would be an excellent workshop for you to be part of.

If these become calculation (or definition) objects, then I was assuming there would be multiple objects, with each definition, e.g. start, with no dry-spell, being an object. That's like each filter, or each graph.

I'm assuming we shouldn't do much more before involving at least David? With the calculation system being such a selling point within the R-Instat databook etc, how come he and Danny didn't include calculation objects from the start. We are nearly 10 years in, and we have lots of other objects in the data book, but not yet calculations?

And should we link this to the need for the General Summaries dialog to be "completed", i.e. to be able to edit the parts of a calculation? And should we call that General Calculations, as it is more general than a summary?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants