Fix for Health State Handling in Storage Health Check #774

itsouvalas · 2025-01-10T06:27:42Z

Issue Summary

The previous implementation of the storage health check had multiple issues in two separate files:

In bin/shield-pipe, the storage plugin test (test-store) failed on the first attempt due to a lack of delay between asset creation and verification, as the storage backend needed time to become consistent. A delay (sleep 5) was added in this file specifically to resolve that issue.
In core/scheduler/chore.go, the health check logic was prematurely marking the storage system as unhealthy after the first failure, even when subsequent retries were successful. This was due to store.Healthy being set to false after the first failure without correction on success.

Symptoms Observed

The system was marking the storage as unhealthy after the first failure, despite successful retries later in the process.
The log entries reflected contradictory statuses, showing a successful retry but still marking the storage as unhealthy.
The issue was particularly evident in cases where transient failures were resolved within the retry window.

Fix Implemented

The corrections were made as follows:

In bin/shield-pipe: A delay (sleep 5) was added between asset creation and retrieval to ensure storage consistency.

# Added to bin/shield-pipe
test-store {
    ...
    say "waiting 5 seconds to account for storage consistency..."
    sleep 5
    ...
}

In core/scheduler/chore.go:
- The logic was updated to ensure the last retry attempt dictates the final health state, rather than the first failure.
- The health state is determined by the final parsed result from the output.
- Added defensive handling to parse the last line of the output in case of multi-line results.

// To account for multiple retries, we'll parse the last line of the output to determine the health of the store
lines := strings.Split(strings.TrimSpace(output), "\n")
lastLine := lines[len(lines)-1]

// Attempt to parse health from the last line
err = json.Unmarshal([]byte(lastLine), &v)
if err != nil {
    w.db.UpdateTaskLog(task.UUID, "\nTEST-STORE: unable to parse script output; marking storage system as UNHEALTHY.\n")
    store.Healthy = false
    w.db.UpdateStoreHealth(store)
    w.db.FailTask(chore.TaskUUID, time.Now())
    return
}

// Final health state determination
if v.Healthy {
    if !store.Healthy {
        w.db.UpdateTaskLog(task.UUID, "\nTEST-STORE: marking storage system as HEALTHY (recovery).\n")
    } else {
        w.db.UpdateTaskLog(task.UUID, "\nTEST-STORE: storage is still HEALTHY.\n")
    }
    store.Healthy = true
    w.db.CompleteTask(chore.TaskUUID, time.Now())
} else {
    w.db.UpdateTaskLog(task.UUID, "\nTEST-STORE: marking storage system as UNHEALTHY.\n")
    store.Healthy = false
    w.db.FailTask(chore.TaskUUID, time.Now())
}

Validation and Testing

The change was validated by simulating failure scenarios where the S3 NoSuchKey error was triggered, followed by successful retries.
The corrected behavior resulted in the health state being correctly set based on the final retry result, as shown in the updated logs.
The delay resolved the issue where the first retrieve operation would fail due to storage consistency delays.

Logs (Before and After Fix)

Before Fix:

FAILED: unable to read from storage
EXITING 2

------
RETRY: `1`
connecting to 192.168.x.x:5444 (tcp/ipv4)
... (success)
EXITING 0

DEBUG: Attempting to parse output:
{"healthy":false}
{"healthy":true}
{"healthy":true}
{"healthy":true}

TEST-STORE: unable to parse script output; marking storage system as UNHEALTHY.

After Fix:

FAILED: unable to read from storage
EXITING 2

------
RETRY: `1`
connecting to 192.168.x.x:5444 (tcp/ipv4)
... (success)
EXITING 0

DEBUG: Attempting to parse output:
{"healthy":false}
{"healthy":true}
{"healthy":true}
{"healthy":true}

DEBUG: Successfully parsed JSON. v.Healthy = true
TEST-STORE: storage is still HEALTHY.

Impact and Benefits

Accuracy: Ensures the final retry attempt dictates the health state, preventing premature marking as unhealthy.
Clarity: The health check logic now reflects the latest state, reducing operational confusion.
Stability: Adding a delay between asset creation and verification improves system reliability.

- Added a delay (`sleep 5`) in `bin/shield-pipe` to address eventual consistency issues in the storage backend. - Updated `core/scheduler/chore.go` to ensure the final retry attempt determines the health state, preventing premature unhealthy state marking. - Improved health check parsing by ensuring only the last line of the output is evaluated. - Adjusted error handling to avoid marking storage as unhealthy on transient failures. - Added debug logging for improved visibility during health state evaluation. Resolves: Issue with health state persisting as unhealthy after successful retries, and test-store failure due to eventual consistency issues.

krutten

LGTM

itsouvalas requested a review from krutten January 10, 2025 06:28

krutten approved these changes Jan 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for Health State Handling in Storage Health Check #774

Fix for Health State Handling in Storage Health Check #774

itsouvalas commented Jan 10, 2025

krutten left a comment

Fix for Health State Handling in Storage Health Check #774

Are you sure you want to change the base?

Fix for Health State Handling in Storage Health Check #774

Conversation

itsouvalas commented Jan 10, 2025

Issue Summary

Symptoms Observed

Fix Implemented

Validation and Testing

Logs (Before and After Fix)

Impact and Benefits

krutten left a comment

Choose a reason for hiding this comment