Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nodes and dead services remaining in the consul catalog #2065

Closed
Stephani0106 opened this issue Apr 18, 2023 · 9 comments
Closed

Nodes and dead services remaining in the consul catalog #2065

Stephani0106 opened this issue Apr 18, 2023 · 9 comments
Labels
type/bug Something isn't working

Comments

@Stephani0106
Copy link

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request. Searching for pre-existing feature requests helps us consolidate datapoints for identical requirements into a single place, thank you!
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request.
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment.

Overview of the Issue

We noticed that the issue began to occur in our dev environment, and later in qas, after the adoption of spot instances within our Kubernetes cluster (AKS). We don't rule out the possibility that the mistake has always happened, but we don't notice it as often as we do now.

Because they are machines with idle capacity, sometimes some of our nodes were killed because the machine was "stolen" by Azure, however our environment whenever it identified these node deaths recreated them with new names to always keep a minimum number of nodes available in the environment.

Our issue within the consul began to be identified just when there were these node deaths. When a node is killed, the Consul does not deregister it from the catalog even though it no longer exists within the cluster. This ends up causing "ghost" services to remain in the Consul catalog with failure status even though they no longer exist.

This results in dead service instances in the Consul catalog and healthcheck issues, for example.

image

Palliative solution

As a palliative solution to work around the problem, we manually removed dead services from the catalog. For this we open a shell in one of the pods of the Consul server and execute the request below:

curl http://127.0.0.1:8500/v1/catalog/service/<service-name> | jq

After making sure that the node in fact no longer exists in the cluster, we run the command below to unregister all the services present in that node:

curl --request PUT -d '{"Node":"<node-name>"}' http://127.0.0.1:8500/v1/catalog/deregister

This solves the problem temporarily, but every time a node is killed or recreated the problem recurs and again we need to perform the steps above.

Expected behavior

We expected Consul to deregister all nodes and services that did not respond properly to healthcheck, thus excluding dead services from its catalog.

Environment details

The version of Consul-k8s in our environment that presents the issue goes from the v1.0.4 and App Version 1.14.4 to v1.1.1 of chart and App Version 1.15.1.
This is because, we identified the error in version 1.0.4 and after upgrading to version 1.1.1 the error persists.

The version of Kubernetes we used goes from 1.24.6 to 1.25.5.

Additional Context

The issue #1817 presents a very similar case with what we find in our environment.

@andrewnazarov
Copy link

andrewnazarov commented Jun 1, 2023

We have kinda the same situation when we have nodes upgrade in GKE. Our use case might be a bit different, because we don't have the K8s synchronisation enabled - the registration and deregistration both are handled by our apps exclusively. However, we also have a timeout when a service gets deregistered forcefully if it's not terminated correctly or the app wasn't able to deregister itself correctly during shutdown. What we do see in common with this though is that we have unhealthy nodes after the k8s upgrade and unhealthy services too which cannot be deregistered due to the timeout because the node is red.

I was wondering if terminationGracePeriodSeconds could be configured for Consul client pods to mitigate this if it were the case. But I don't see the setting in the chart.

Our chart version is 1.0.4.

@FelipeEmerim
Copy link

FelipeEmerim commented Jun 20, 2023

This is a bigger problem than it seems to be. If the dead service's IP overlaps with a live service IP, envoy will create a passthrough cluster for the dead service and drop every connection. This is common in kubenet setups where pods use an internal network. Every time a node is removed we see dead services in Consul. In some cases as described in this issue, in other cases as in #2085.

This makes Consul very difficult to work with in spot environments and every cluster upgrade problematic. @andrewnazarov suggested tweaking terminationGracePeriodSeconds in the client nodes, but consul-k8s dropped client nodes after v1.0.0. Is there any other workaround on this issue?

@david-yu
Copy link
Contributor

Closing as the PR that addresses this is now merged: #2571. This should be released in 1.2.x, 1.1.x, and 1.0.x by mid August timeframe for our next set of patch releases.

@andrewnazarov
Copy link

Version 1.1.4 didn't solve our problem. Still experiencing dead nodes and services.

@komapa
Copy link

komapa commented Nov 3, 2023

1.2.x did not solve it for us either. We have 2000+ virtual nodes when our cluster is only 28 nodes right now.

@david-yu
Copy link
Contributor

david-yu commented Nov 3, 2023

Could you test with the latest versions of Consul as well? If this still is an issue, we should perhaps re-open but we need more information on how to reproduce. I believe K8s 1.26 and above actually helps solve this issue with graceful node shutdown: https://kubernetes.io/blog/2023/08/16/kubernetes-1-28-non-graceful-node-shutdown-ga/#:~:text=The%20Graceful%20Node%20Shutdown%20feature,a%20non%2Dgraceful%20node%20shutdown.

@MageshSrinivasulu
Copy link

MageshSrinivasulu commented Mar 1, 2024

Facing the same issue with k8s 1.27 as well. Using consul 1.14.10

@MageshSrinivasulu
Copy link

MageshSrinivasulu commented Jul 31, 2024

Still facing the same issue with k8s 1.28 as well, using consul 1.16.6. This is a never-ending issue. @david-yu Please reopen this issue

Apart from deleting the node that doesn't exist anymore what helps me is scaling down the impacted service to zero and scaling it back up again which removes the duplicate or bad entries

Below is how I can consistently able to reproduce the issue

  1. pod A running in node A.
  2. cordon node A
  3. Let the pod A schedule in node B
  4. It leaves 2 entries of an instance in the consul catalog meaning the pod IP of both old and new pod A. Where the health of new pod A flips between healthy and unhealthy and old pod A entry is always unhealthy

This is Crazy

@akrymets
Copy link

akrymets commented Nov 27, 2024

Still facing this issue using consul 1.19.1 with AWS ECS cluster nodes.
I remember it since version 1.4.4 with no changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants