-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Patch transient CI failures due to busy file system #214
base: develop
Are you sure you want to change the base?
Patch transient CI failures due to busy file system #214
Conversation
It's a relief that this approach seems to resolve the issue with the CI file system. Maybe go for 10ms for safety? There was one remaining failure in the CI of the PR. |
After some testing of different sleeping times, I would propose to use 50 milliseconds since in my tests 10 milliseconds sometimes did not suffice but in yesterday's tests 5 milliseconds were fine. Given this variability of the occurrence of the CI fails, 50 milliseconds is maybe the more future proof option, which still results in reasonable CI runtimes. |
I think it would make sense to wait with merging this PR until tomorrow because the CI's behavior was so different today compared to yesterday. Hence, I would like to test again tomorrow to get an better idea of a suitable sleeping time. |
What about merging now? Have you tested how this sc update works with the p4est CI, and if there's more to fix there (such as the patching etc.)? |
Sounds good to me. After more tests, I increased the I/O sleeping time again (now 0.5 seconds). Since my tests showed the behavior of the CI's file system is very much non-deterministic, I think the increase provides a more reliable CI and the impact on the overall CI time is still small.
This PR only changes functions that are not used in the p4est CI since the only parallel IO functions in p4est's CI are the legacy functions |
Ok, hoping the overall run time of the CI does not increase out of proportion due to the sleep calls. Sounds good going after the p4est related functions next. |
Patch transient CI failures due to busy file system
Following up on issue #213.
As discussed in the referenced issue, we observed transient CI failures due to
test_scda.c
. The failures appear due to I/O operations that are according to MPI I/O's error code successful but have an unexpected MPI count of 0, indicating that no actual I/O operation was performed. During the investigation of this issue, I observed that calling a sleep between I/O operations reduces the frequency of the transient failures or even seems to overcome them, suggesting that the underlying cause may be a busy file system on the CI runners.This PR applies a patch in the CI to wait 5 milliseconds if a successful I/O operation unexpectedly resulted in count 0 and then retry the I/O operation once. The sleeping time can be adjusted in the CI by adjusting the variable
IO_SLEEP_TIME
(cf. eb190eb). In my tests 5 milliseconds ensured a reliable CI. However, past experiences indicate that the extent of the issue can vary significantly over time. Hence, it may be necessary to adjust the choice of the sleeping time in the future.