Skip to content

Commit

Permalink
Merge #4383
Browse files Browse the repository at this point in the history
4383: Allow slight tolerance for restart health check r=rafal-ch a=rafal-ch

This PR makes the NCTL health checks more relaxed when it comes to validating the number of restarts. This is to guard against flakiness, i.e. when the test finishes correctly and all assertions hold (for example: network is correctly upgraded), but there were slightly more node restarts during the process than expected.

For example, assuming that there are 10 restarts allowed, NCTL will:
* report error if there are more than 10+50%=15 restarts:
```
NCTL :: Adjusted restarts allowed: 15
NCTL :: ERROR: ALLOWED: 15 < TOTAL: 16
```

* warn if there are more than 10, but less than 10+50%=15 restarts: 
```
NCTL :: WARN: Test would fail without allowed restart adjustment
NCTL :: SUCCESS: ALLOWED: 15 = TOTAL: 13
```

* finish successfully otherwise

Closes #4360 

Co-authored-by: Rafał Chabowski <rafal@casperlabs.io>
  • Loading branch information
casperlabs-bors-ng[bot] and Rafał Chabowski authored Oct 30, 2023
2 parents fefd185 + 71144cb commit be8e93c
Showing 1 changed file with 11 additions and 3 deletions.
14 changes: 11 additions & 3 deletions utils/nctl/sh/scenarios/common/health_checks.sh
Original file line number Diff line number Diff line change
Expand Up @@ -199,6 +199,7 @@ function assert_crash_count() {
function assert_restart_count() {
local COUNT
local TOTAL
local ADJUSTED_RESTARTS_ALLOWED

log_step "Looking for restarts in logs..."

Expand All @@ -218,12 +219,19 @@ function assert_restart_count() {
return 0
fi

if [ "$RESTARTS_ALLOWED" != "$TOTAL" ]; then
log "ERROR: ALLOWED: $RESTARTS_ALLOWED != TOTAL: $TOTAL"
ADJUSTED_RESTARTS_ALLOWED=$((RESTARTS_ALLOWED + RESTARTS_ALLOWED / 2))
log "Adjusted restarts allowed: $ADJUSTED_RESTARTS_ALLOWED"

if [ "$TOTAL" -gt "$ADJUSTED_RESTARTS_ALLOWED" ]; then
log "ERROR: ALLOWED: $ADJUSTED_RESTARTS_ALLOWED < TOTAL: $TOTAL"
exit 1
fi

log "SUCCESS: ALLOWED: $RESTARTS_ALLOWED = TOTAL: $TOTAL"
if [ "$TOTAL" -gt "$RESTARTS_ALLOWED" ]; then
log "WARN: Test would fail without allowed restart adjustment"
fi

log "SUCCESS: ALLOWED: $ADJUSTED_RESTARTS_ALLOWED > TOTAL: $TOTAL"
}

##########################################################
Expand Down

0 comments on commit be8e93c

Please sign in to comment.