Patch transient CI failures due to busy file system #214

tim-griesbach · 2025-01-15T16:14:20Z

Patch transient CI failures due to busy file system

Following up on issue #213.

As discussed in the referenced issue, we observed transient CI failures due to test_scda.c. The failures appear due to I/O operations that are according to MPI I/O's error code successful but have an unexpected MPI count of 0, indicating that no actual I/O operation was performed. During the investigation of this issue, I observed that calling a sleep between I/O operations reduces the frequency of the transient failures or even seems to overcome them, suggesting that the underlying cause may be a busy file system on the CI runners.

This PR applies a patch in the CI to wait 5 milliseconds if a successful I/O operation unexpectedly resulted in count 0 and then retry the I/O operation once. The sleeping time can be adjusted in the CI by adjusting the variable IO_SLEEP_TIME (cf. eb190eb). In my tests 5 milliseconds ensured a reliable CI. However, past experiences indicate that the extent of the issue can vary significantly over time. Hence, it may be necessary to adjust the choice of the sleeping time in the future.

.github/workflows/ci.yml

.github/workflows/ci_cmake.yml

doc/patch/patch-CI-IO.patch

cburstedde · 2025-01-16T15:18:30Z

It's a relief that this approach seems to resolve the issue with the CI file system. Maybe go for 10ms for safety? There was one remaining failure in the CI of the PR.

tim-griesbach · 2025-01-16T16:21:49Z

It's a relief that this approach seems to resolve the issue with the CI file system. Maybe go for 10ms for safety? There was one remaining failure in the CI of the PR.

After some testing of different sleeping times, I would propose to use 50 milliseconds since in my tests 10 milliseconds sometimes did not suffice but in yesterday's tests 5 milliseconds were fine. Given this variability of the occurrence of the CI fails, 50 milliseconds is maybe the more future proof option, which still results in reasonable CI runtimes.

tim-griesbach · 2025-01-16T16:41:19Z

I think it would make sense to wait with merging this PR until tomorrow because the CI's behavior was so different today compared to yesterday. Hence, I would like to test again tomorrow to get an better idea of a suitable sleeping time.

cburstedde · 2025-01-17T15:29:28Z

What about merging now? Have you tested how this sc update works with the p4est CI, and if there's more to fix there (such as the patching etc.)?

tim-griesbach · 2025-01-17T15:53:58Z

What about merging now?

Sounds good to me. After more tests, I increased the I/O sleeping time again (now 0.5 seconds). Since my tests showed the behavior of the CI's file system is very much non-deterministic, I think the increase provides a more reliable CI and the impact on the overall CI time is still small.

Have you tested how this sc update works with the p4est CI, and if there's more to fix there (such as the patching etc.)?

This PR only changes functions that are not used in the p4est CI since the only parallel IO functions in p4est's CI are the legacy functions sc_mpi_{read,write}. I would propose to create a separate PR for this since firsts tests in p4est indicated that the higher I/O intensity in p4est tests may be an additional challenge.

cburstedde · 2025-01-17T16:52:26Z

Ok, hoping the overall run time of the CI does not increase out of proportion due to the sleep calls.

Sounds good going after the p4est related functions next.

tim-griesbach added 4 commits January 15, 2025 15:28

Add patch for IO in the CI

aa4aeff

CI: Apply patch for IO

678d541

CI: Increase IO sleeping time

eb190eb

Fix typo

04971f1

tim-griesbach mentioned this pull request Jan 16, 2025

Transient CI failures #213

Open