Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving efficiency in Crop Definitions #9372

Merged
merged 13 commits into from
Jan 23, 2025

Conversation

lilyclements
Copy link
Contributor

@lilyclements lilyclements commented Jan 17, 2025

Linked to PR #9052
After discussions with EPICSA on this, we wanted to be able to have this code (a) run quicker and (b) have the option to only produce probability (risk) tables. This is because we might have an instance where the user looks for a large number of combinations and at the moment this does not work in R-Instat. This was reported first by @Vitalis95 here

This code improves the efficiency:

  1. Added a loop so that the calculation of the amount of rainfall each year is only performed once for each "plant length by plant date" combination (before it was calculating it for the plant length x plant day x total rainfall. We can check against total rainfall later, but we do not need to recalculate for each total rainfall)
  2. The R Code returns as lists, and now combines the lists together. This is much more efficient than the previous system.
  3. New option (for us to put in the Crops Definitions dialog) to have crop definitions, or to not. At the moment this runs by default. The parameter for this is return_crop_definitions and is default TRUE since that is how it works in R-Instat at the moment.

Follow up to this is to introduce a checkbox for "Return Crop Definitions" in the dialog in R-Instat. @rdstern does this sound like a good suggestion to you? I am hoping this will help us have the option to just return those probability tables, and then exporting to EPICSA will be much quicker.

Run times for 3213 combinations was 5 minutes 37 seconds
Run times for this is now 59 seconds.

I'm currently trying for 39442 combinations. This before would time out so I am not sure how long this used to take, but it took at least an hour. I will update with the new time when I have it.

@lilyclements lilyclements marked this pull request as ready for review January 17, 2025 18:03
@lilyclements
Copy link
Contributor Author

lilyclements commented Jan 17, 2025

In the PR on this (#9052) we have the following suggestions:

plant_days=seq(from = 300, to = 370, by = 5)
plant_lengths=seq(from = 50, to = 220, by = 5) 
rain_totals=seq(from = 200, to = 750, by = 25) 

This set of combinations in this new R code took 6 minutes 58 seconds to run with the new (it timed out previously)

[1] "2025-01-17 18:13:58 GMT"
[1] "2025-01-17 18:21:56 GMT"

I've tried to run it three different times on the old code. On this third time it's been running for over an hour so I'm quite satisfied this has helped somewhat.
I'm sure there are more ways we can optimise further in the future.

@rdstern
Copy link
Collaborator

rdstern commented Jan 20, 2025

@lilyclements here is what I tried, on your new version - with the Zambia data - I made it 312 years. I got this error.

image

I have not yet checked the old version with these data - I'll do that now.

@lilyclements
Copy link
Contributor Author

lilyclements commented Jan 20, 2025

@rdstern thanks for this! The issue is because I've put in "years" as a hardcoded variable! Oops. You have "s_year" so it can't find this "year" since it's not called "year". I've now fixed this so hopefully that error should be fixed :)

@rdstern
Copy link
Collaborator

rdstern commented Jan 20, 2025

@lilyclements ok, very well done, it now works. I tried - for comparison - with the same values as for 0.8.1 above. Now it works. Great. But:
a) There is a lot more in the output window that I assume is for testing, and you will soon not display:

Uploading image.png…

b) And just a teensie inconvenience in the summary data frame! First I really like that you have the with and without in the same summary sheet. That's a great improvement.
I'm less keen that the summary has "lost" the stations. I wondered why the old version had 225 rows and you just had 45! That's because there are 5 stations and 225 = 45 * 5. I think Graham and Peter just might like to keep the stations in the summaries!!!

@lilyclements
Copy link
Contributor Author

@rdstern thanks for this - the problems are now hopefully fixed and so it is ready for you to look at again

@rdstern
Copy link
Collaborator

rdstern commented Jan 21, 2025

@lilyclements it still works, but I have a few buts:
a) I thought you were going to get rid of that second checkbox in the dialog. Of course, happy if you get one of the VB team to do that.
b) It now gives the 225 results as before. But I prefer the order of the rows in the previous version. Usually we want to know what happens within a station, for those 3 factors, so have the station at the top level, i.e. 45 for Chipata, etc. Not so clear after that but earleir we had Planting day, then length of crop and lastly the rainfall requirement. If you keep the same order, then it will be easier to compare.
c) I may have got it wrong, but I tried to reorder and then copied over one of the conditions from the earleir run (2 elements only, into V1 and I may have done it wrong?)

image

The results are not the same! With the 3 conditions I think you always get zero and that doesn't seem quite right either!

@lilyclements
Copy link
Contributor Author

a) I've put it into the VB code now

b) It was a different column order in the "crops_def" to "crop_props" data frames before, so I've put it to be the same column order in them both now (station - day - length - rain, but this can be very easily changed)

c) Agreed on the station point. It used to be station then year then planting day, length, rain. (So you would have all your Saltpond - 1944's, then all your Saltpond 1945's, etc) This feels less intuitive to me, but, I haven't seen this dialog in use so you have a much better understanding!
In this version, I've ordered it to be by station, day, length, rain now too in terms of row ordering. What do you think?

d) You're absolutely right on the issue that it is no longer giving the same results as the previous version. That is something I accidentally added earlier today when fixing something else. I've now sorted that, so this should now be giving the same results as the previous version.

@rdstern this is now ready for review again

@rdstern
Copy link
Collaborator

rdstern commented Jan 22, 2025

@lilyclements this is great. I have spent a long time however, looking into the results. They are often the same and they are similar when they are different. I wonder why?
The Zambia data set is pretty awkward. There are the 5 stations and quite a lot of missing values. But, in addition there are some years when the start is present, but the end is missing and vv too. I suspect these years may be causing the differences in thew results. Indeed I'm not sure how those years are dealt with, either before or now?

I suspect any year when either is missing (or even when the seasonal total is missing) should simply be omitted from the calculations - indeed maybe you do that now? I'll check tomorrow what that does to the calculations.
In the future for the end we will have an option to include the last day for the censured ends of the season, and I suggest we do that. We may want to include the censured start years, as a fail - for the 3 conditions anyway.

@rdstern
Copy link
Collaborator

rdstern commented Jan 22, 2025

@lilyclements it seems to work (almost) fine now! And it is faster. It took 23 seconds for 45 conditions on 5 stations compared to over 50 seconds using version 0.8.1.
It often gives the same results, but not always. I wondered whether missing values might be the problem, so I changed the summary dataset to have both the start and end missing if either was missing. There are still some small differences. Se below for Chipata that is usually the same, but occasionally slightly different.

image

Lundazi is interesting. It is always slightly different, and is the worst of the stations in terms of current missing data:

image

There are two other points perhaps for later:

a) When the start and end are finished we should assess how to use them in this dialog. What should we do if there is no start? We could think of that as a failed year, so include that year, and it always fails. Or we could report that issue separately and omit those years from these combined risks. I assume that's what we do now.
For the end I suggest we use the filled-in year, when if it doesn't end then we take the last day we used.
b) The dialog is a bit messy, when returning. to it. You go back to the daily data and that fills the 4 receivers automatically. Then you go to the summary data and it does not. Maybe it could? Alternatively this is the only dialog where the selector is used in this way. It would be much simpler if we gave the selector twice - as we do elsewhere. We could also consider that perhaps.

@lilyclements
Copy link
Contributor Author

@rdstern thanks for outlining where you are seeing these differences - this difference in the values should now be fixed! The issue was occurring when the rainfall in the entire year for that station was missing. My mistake, and fixed it now! This is ready for re-review.

I agree it would be very good to have a discussion to point (a). To point (b), I'm not sure how fixable it is in it's current state. Perhaps someone who knows the VB code better would be able to say if this is just a very easy fix, or if we do need to have two data frames as you suggest.

A side note, if you were interested:
One reason why it is much quicker now is that before we would calculate the "actual rainfall amount" from the Planting Date to the "Planting Date + Length".
We would previously repeat this for every "plant_date", "plant_length", "rain_total" combination. However, we did not need to calculate this for every "rain_total", since that is a calculation which comes later. We only need to calculate for every "plant_date" and "plant_length" combination.
This did mean being fancy with some code later on, which is where I accidentally missed out the cases where the entire year is NA (I was looking if any were NA, not if everything or any were NA).
This meant that the new code were giving these as FALSE but the old code was as NA.

Copy link
Collaborator

@rdstern rdstern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lilyclements that looks great now. I am approving.
@N-thony please can you check and merge if ok.

@N-thony when you are checking, there is one small item that maybe for you to change. Here is the output window after running the dialog:
image

Now I like the fact there is a brief report in the output window, when the main results are the 2 new data frames. But there are now also 2 lines at the bottom about undo, and I wonder if those are needed? They seem to me to be a distraction. If they are needed, then I have no problem with them coming in the log window, but do they need also to be here?

Thanks

@N-thony N-thony merged commit 2eb584e into IDEMSInternational:master Jan 23, 2025
2 checks passed
@N-thony
Copy link
Collaborator

N-thony commented Jan 23, 2025

@rdstern can you confirm this?
When I open the PICSA Crop Dialogue, the autofill works well, and when I click on Start Receiver, it clears the selector, I think because it can find the corresponding column and dataset? But now when I go back to Rain receiver, my dataset is back but without the list of variables tho I had it at the first time
image
image
image

@rdstern
Copy link
Collaborator

rdstern commented Jan 23, 2025

@N-thony you are right in your observations above and that it is currently a pain, especially when the dialog is re-opened. I have made issue #9384 to improve the working of that dialog. I am happy that this is already merged, and that we make the cosmetic improvements in a separate pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants