-
Notifications
You must be signed in to change notification settings - Fork 326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nodes and dead services remaining in the consul catalog #2065
Comments
We have kinda the same situation when we have nodes upgrade in GKE. Our use case might be a bit different, because we don't have the K8s synchronisation enabled - the registration and deregistration both are handled by our apps exclusively. However, we also have a timeout when a service gets deregistered forcefully if it's not terminated correctly or the app wasn't able to deregister itself correctly during shutdown. What we do see in common with this though is that we have unhealthy nodes after the k8s upgrade and unhealthy services too which cannot be deregistered due to the timeout because the node is red. I was wondering if Our chart version is 1.0.4. |
This is a bigger problem than it seems to be. If the dead service's IP overlaps with a live service IP, envoy will create a passthrough cluster for the dead service and drop every connection. This is common in kubenet setups where pods use an internal network. Every time a node is removed we see dead services in Consul. In some cases as described in this issue, in other cases as in #2085. This makes Consul very difficult to work with in spot environments and every cluster upgrade problematic. @andrewnazarov suggested tweaking |
Closing as the PR that addresses this is now merged: #2571. This should be released in 1.2.x, 1.1.x, and 1.0.x by mid August timeframe for our next set of patch releases. |
Version 1.1.4 didn't solve our problem. Still experiencing dead nodes and services. |
1.2.x did not solve it for us either. We have 2000+ virtual nodes when our cluster is only 28 nodes right now. |
Could you test with the latest versions of Consul as well? If this still is an issue, we should perhaps re-open but we need more information on how to reproduce. I believe K8s 1.26 and above actually helps solve this issue with graceful node shutdown: https://kubernetes.io/blog/2023/08/16/kubernetes-1-28-non-graceful-node-shutdown-ga/#:~:text=The%20Graceful%20Node%20Shutdown%20feature,a%20non%2Dgraceful%20node%20shutdown. |
Facing the same issue with k8s 1.27 as well. Using consul 1.14.10 |
Still facing the same issue with k8s 1.28 as well, using consul 1.16.6. This is a never-ending issue. @david-yu Please reopen this issue Apart from deleting the node that doesn't exist anymore what helps me is scaling down the impacted service to zero and scaling it back up again which removes the duplicate or bad entries Below is how I can consistently able to reproduce the issue
This is Crazy |
Still facing this issue using consul 1.19.1 with AWS ECS cluster nodes. |
Community Note
Overview of the Issue
We noticed that the issue began to occur in our dev environment, and later in qas, after the adoption of spot instances within our Kubernetes cluster (AKS). We don't rule out the possibility that the mistake has always happened, but we don't notice it as often as we do now.
Because they are machines with idle capacity, sometimes some of our nodes were killed because the machine was "stolen" by Azure, however our environment whenever it identified these node deaths recreated them with new names to always keep a minimum number of nodes available in the environment.
Our issue within the consul began to be identified just when there were these node deaths. When a node is killed, the Consul does not deregister it from the catalog even though it no longer exists within the cluster. This ends up causing "ghost" services to remain in the Consul catalog with failure status even though they no longer exist.
This results in dead service instances in the Consul catalog and healthcheck issues, for example.
Palliative solution
As a palliative solution to work around the problem, we manually removed dead services from the catalog. For this we open a shell in one of the pods of the Consul server and execute the request below:
After making sure that the node in fact no longer exists in the cluster, we run the command below to unregister all the services present in that node:
curl --request PUT -d '{"Node":"<node-name>"}' http://127.0.0.1:8500/v1/catalog/deregister
This solves the problem temporarily, but every time a node is killed or recreated the problem recurs and again we need to perform the steps above.
Expected behavior
We expected Consul to deregister all nodes and services that did not respond properly to healthcheck, thus excluding dead services from its catalog.
Environment details
The version of Consul-k8s in our environment that presents the issue goes from the v1.0.4 and App Version 1.14.4 to v1.1.1 of chart and App Version 1.15.1.
This is because, we identified the error in version 1.0.4 and after upgrading to version 1.1.1 the error persists.
The version of Kubernetes we used goes from 1.24.6 to 1.25.5.
Additional Context
The issue #1817 presents a very similar case with what we find in our environment.
The text was updated successfully, but these errors were encountered: