Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for Health State Handling in Storage Health Check #774

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

itsouvalas
Copy link
Contributor

Issue Summary

The previous implementation of the storage health check had multiple issues in two separate files:

  • In bin/shield-pipe, the storage plugin test (test-store) failed on the first attempt due to a lack of delay between asset creation and verification, as the storage backend needed time to become consistent. A delay (sleep 5) was added in this file specifically to resolve that issue.
  • In core/scheduler/chore.go, the health check logic was prematurely marking the storage system as unhealthy after the first failure, even when subsequent retries were successful. This was due to store.Healthy being set to false after the first failure without correction on success.

Symptoms Observed

  • The system was marking the storage as unhealthy after the first failure, despite successful retries later in the process.
  • The log entries reflected contradictory statuses, showing a successful retry but still marking the storage as unhealthy.
  • The issue was particularly evident in cases where transient failures were resolved within the retry window.

Fix Implemented

The corrections were made as follows:

  • In bin/shield-pipe: A delay (sleep 5) was added between asset creation and retrieval to ensure storage consistency.
# Added to bin/shield-pipe
test-store {
    ...
    say "waiting 5 seconds to account for storage consistency..."
    sleep 5
    ...
}
  • In core/scheduler/chore.go:
    • The logic was updated to ensure the last retry attempt dictates the final health state, rather than the first failure.
    • The health state is determined by the final parsed result from the output.
    • Added defensive handling to parse the last line of the output in case of multi-line results.
// To account for multiple retries, we'll parse the last line of the output to determine the health of the store
lines := strings.Split(strings.TrimSpace(output), "\n")
lastLine := lines[len(lines)-1]

// Attempt to parse health from the last line
err = json.Unmarshal([]byte(lastLine), &v)
if err != nil {
    w.db.UpdateTaskLog(task.UUID, "\nTEST-STORE: unable to parse script output; marking storage system as UNHEALTHY.\n")
    store.Healthy = false
    w.db.UpdateStoreHealth(store)
    w.db.FailTask(chore.TaskUUID, time.Now())
    return
}

// Final health state determination
if v.Healthy {
    if !store.Healthy {
        w.db.UpdateTaskLog(task.UUID, "\nTEST-STORE: marking storage system as HEALTHY (recovery).\n")
    } else {
        w.db.UpdateTaskLog(task.UUID, "\nTEST-STORE: storage is still HEALTHY.\n")
    }
    store.Healthy = true
    w.db.CompleteTask(chore.TaskUUID, time.Now())
} else {
    w.db.UpdateTaskLog(task.UUID, "\nTEST-STORE: marking storage system as UNHEALTHY.\n")
    store.Healthy = false
    w.db.FailTask(chore.TaskUUID, time.Now())
}

Validation and Testing

  • The change was validated by simulating failure scenarios where the S3 NoSuchKey error was triggered, followed by successful retries.
  • The corrected behavior resulted in the health state being correctly set based on the final retry result, as shown in the updated logs.
  • The delay resolved the issue where the first retrieve operation would fail due to storage consistency delays.

Logs (Before and After Fix)

Before Fix:

FAILED: unable to read from storage
EXITING 2

------
RETRY: `1`
connecting to 192.168.x.x:5444 (tcp/ipv4)
... (success)
EXITING 0

DEBUG: Attempting to parse output:
{"healthy":false}
{"healthy":true}
{"healthy":true}
{"healthy":true}

TEST-STORE: unable to parse script output; marking storage system as UNHEALTHY.

After Fix:

FAILED: unable to read from storage
EXITING 2

------
RETRY: `1`
connecting to 192.168.x.x:5444 (tcp/ipv4)
... (success)
EXITING 0

DEBUG: Attempting to parse output:
{"healthy":false}
{"healthy":true}
{"healthy":true}
{"healthy":true}

DEBUG: Successfully parsed JSON. v.Healthy = true
TEST-STORE: storage is still HEALTHY.

Impact and Benefits

  • Accuracy: Ensures the final retry attempt dictates the health state, preventing premature marking as unhealthy.
  • Clarity: The health check logic now reflects the latest state, reducing operational confusion.
  • Stability: Adding a delay between asset creation and verification improves system reliability.

- Added a delay (`sleep 5`) in `bin/shield-pipe` to address eventual consistency issues in the storage backend.
- Updated `core/scheduler/chore.go` to ensure the final retry attempt determines the health state, preventing premature unhealthy state marking.
- Improved health check parsing by ensuring only the last line of the output is evaluated.
- Adjusted error handling to avoid marking storage as unhealthy on transient failures.
- Added debug logging for improved visibility during health state evaluation.

Resolves: Issue with health state persisting as unhealthy after successful retries,
and test-store failure due to eventual consistency issues.
@itsouvalas itsouvalas requested a review from krutten January 10, 2025 06:28
Copy link
Contributor

@krutten krutten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants