Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patch transient CI failures due to busy file system #214

Open
wants to merge 12 commits into
base: develop
Choose a base branch
from

Conversation

tim-griesbach
Copy link
Collaborator

@tim-griesbach tim-griesbach commented Jan 15, 2025

Patch transient CI failures due to busy file system

Following up on issue #213.

As discussed in the referenced issue, we observed transient CI failures due to test_scda.c. The failures appear due to I/O operations that are according to MPI I/O's error code successful but have an unexpected MPI count of 0, indicating that no actual I/O operation was performed. During the investigation of this issue, I observed that calling a sleep between I/O operations reduces the frequency of the transient failures or even seems to overcome them, suggesting that the underlying cause may be a busy file system on the CI runners.

This PR applies a patch in the CI to wait 5 milliseconds if a successful I/O operation unexpectedly resulted in count 0 and then retry the I/O operation once. The sleeping time can be adjusted in the CI by adjusting the variable IO_SLEEP_TIME (cf. eb190eb). In my tests 5 milliseconds ensured a reliable CI. However, past experiences indicate that the extent of the issue can vary significantly over time. Hence, it may be necessary to adjust the choice of the sleeping time in the future.

@cburstedde
Copy link
Owner

It's a relief that this approach seems to resolve the issue with the CI file system. Maybe go for 10ms for safety? There was one remaining failure in the CI of the PR.

@tim-griesbach
Copy link
Collaborator Author

It's a relief that this approach seems to resolve the issue with the CI file system. Maybe go for 10ms for safety? There was one remaining failure in the CI of the PR.

After some testing of different sleeping times, I would propose to use 50 milliseconds since in my tests 10 milliseconds sometimes did not suffice but in yesterday's tests 5 milliseconds were fine. Given this variability of the occurrence of the CI fails, 50 milliseconds is maybe the more future proof option, which still results in reasonable CI runtimes.

@tim-griesbach
Copy link
Collaborator Author

I think it would make sense to wait with merging this PR until tomorrow because the CI's behavior was so different today compared to yesterday. Hence, I would like to test again tomorrow to get an better idea of a suitable sleeping time.

@cburstedde
Copy link
Owner

What about merging now? Have you tested how this sc update works with the p4est CI, and if there's more to fix there (such as the patching etc.)?

@tim-griesbach
Copy link
Collaborator Author

What about merging now?

Sounds good to me. After more tests, I increased the I/O sleeping time again (now 0.5 seconds). Since my tests showed the behavior of the CI's file system is very much non-deterministic, I think the increase provides a more reliable CI and the impact on the overall CI time is still small.

Have you tested how this sc update works with the p4est CI, and if there's more to fix there (such as the patching etc.)?

This PR only changes functions that are not used in the p4est CI since the only parallel IO functions in p4est's CI are the legacy functions sc_mpi_{read,write}. I would propose to create a separate PR for this since firsts tests in p4est indicated that the higher I/O intensity in p4est tests may be an additional challenge.

@cburstedde
Copy link
Owner

Ok, hoping the overall run time of the CI does not increase out of proportion due to the sleep calls.

Sounds good going after the p4est related functions next.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants