-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Include healthcheck logic for helper scripts running as sidecars #1842
base: alpha
Are you sure you want to change the base?
Include healthcheck logic for helper scripts running as sidecars #1842
Conversation
FWIW Here is my preview network pool, and cncli containers showing healthy once the script was copied in and healthcheck interval was reached:
|
Looks good! cp the script into cncli sync, validate, leaderlog, pt-send-slots, pt-send-tip containers Execute the script with docker exec.
Monitor the containers until the healthcheck interval occurs and that they are marked healthy.
Adjusted RETRIES
Adjusted CPU_THRESHOLD
|
Further testing... I was able to test with higher cpu load after deleting the cncli db and re-syncing. Result
Line 44 of healthcheck.sh: This seems to fix it...
With the above change, when cpu load is higher than CPU_THRESHOLD, this is the result:
|
Yeah, there are rare instances where cncli percentage can be high, but this tends to be when resyncing the entire db and/or a cncli init is running. Occasionally if there is an issue with node process itself, like if it gets stuck chainsync/blockfetch and never completes, I have also seen cncli get a high percentage, but otherwise its quite rare to see it increase. I figured with mithril-signer or db-sync, it might be more useful. |
@adamsthws Feel free to submit suggestions to adjust the Thanks for the testing. |
Testing revealed that setting RETRIES=0 results in script exit 1 without running the loop... it would be preferable to run the loop once when RETRIES=0. Suggestion - Modify the loop condition to handle RETRIES=0 by changing line 39 to the following:
Or...
|
I started thinkinng about a cncli specific check. The following function is an idea to check cncli status...
Perhaps would be improved further by also checking if sync is incrementing, so the healthcheck doesn't fail during initial sync. How would you feel about adding me as a commit co-author if you decide to use this? |
@adamsthws I'm happy to make you a co-author even for something simple, for example if you know how to submit a suggestion go ahead an apply one for |
In regards to the larger block for cncli checks, first it is clear lots of thought went into it. This portion:
Sleeps of 180 exceed the current timeout period of 100. Options:
With container settings of 3 retries and 5 minute interval w/ 100 second timeout it is 15 minutes from the last healthy response, or 10 minutes from the first unhealthy response, before the container exhausts retries and is marked unhealthy. I think this covers the two 180 second sleeps, even if the operator reduces the interval and timeouts when not running the node. Separately, conversations outside of this PR and thread have pointed to some of the logic used in KOIOS for db-sync, also that it could also be used for checking the sqlite DB for cncli.
I haven't examined what the common drift might be for a db-sync instance from the last block produced and for cncli I suspect we could make it shorter than 1 hour. These are just my thoughts. If you think that I overlooked some aspect please don't hesitate to continue the discussion. |
Thankyou for the feedback. It seems the limitation of using 'cncli status' in the healthcheck is that it can't be used to determine if sync is progressing... Rather, it only indicates that it is either synced or not-synced. The long timeout allows time for sync to complete but I see that has introduced observability shortcomings. Instead, checking for cncli sync progress via sqlite db would seem to make more sense. |
Description
Enhances the healthcheck.sh script to work for checking permissions on sidecar containers (helper scripts) via the ENTRYPOINT_PROCESS.
Where should the reviewer start?
/home/guild/.scripts/
.Testing different CPU Usage values
80
%) at a value you want to mark a container unhealthy when it is exceeded.Testing different amount of retries (internal to healthcheck.sh script).
20
) at a number of retries you want to perform if the CPU usage is above the CPU_THRESHOLD value before exiting non zeroCurrently it is a 3 second delay between checks, so 20 retries results in up to 60 seconds before the healthcheck will exit as unhealthy due to CPU load.
Testing different healthcheck values (external to healthcheck.sh script).
The current HEALTHCHECK of the container image is:
Reducing the start period and intervals to something more appropriate for the sidecar script will result in a much shorter period to determine the sidecar containers health.
Make sure to keep the environment variable RETRIES * 3 < container healthcheck timeout to avoid marking the container unhealthy before the script will return during periods of high cpu load.
Motivation and context
Issue #1841
Which issue it fixes?
Closes #1841
How has this been tested?
docker cp
the script into preview network cncli sync, validate and leaderlog containers and waiting until the interval runs the scriptdocker exec
to confirm it reports healthyAdditional Details
There is a SLEEPING_SCRIPTS array which is used for validate and leaderlog to still check for the cncli binary, but not consider a sleep period for validate and leaderlog to be unhealthy. Not 100% sure this is the best approach, but with sleep periods being variable I felt it was likely an acceptable middle ground.
Please do not hesitate to suggest an alternative approach to handling sleeping sidecars healthchecks if you think you have an improvement.
@adamsthws if you could please copy this into your sidecar containers (and your pool) and report back any results. I am marking this as a draft PR for the time being until testing is completed, after which if things look good I will mark it for review and get feedback from others.
Thanks