Nodes and dead services remaining in the consul catalog #2065

Stephani0106 · 2023-04-18T18:43:13Z

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request. Searching for pre-existing feature requests helps us consolidate datapoints for identical requirements into a single place, thank you!
Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request.
If you are interested in working on this issue or have submitted a pull request, please leave a comment.

Overview of the Issue

We noticed that the issue began to occur in our dev environment, and later in qas, after the adoption of spot instances within our Kubernetes cluster (AKS). We don't rule out the possibility that the mistake has always happened, but we don't notice it as often as we do now.

Because they are machines with idle capacity, sometimes some of our nodes were killed because the machine was "stolen" by Azure, however our environment whenever it identified these node deaths recreated them with new names to always keep a minimum number of nodes available in the environment.

Our issue within the consul began to be identified just when there were these node deaths. When a node is killed, the Consul does not deregister it from the catalog even though it no longer exists within the cluster. This ends up causing "ghost" services to remain in the Consul catalog with failure status even though they no longer exist.

This results in dead service instances in the Consul catalog and healthcheck issues, for example.

Palliative solution

As a palliative solution to work around the problem, we manually removed dead services from the catalog. For this we open a shell in one of the pods of the Consul server and execute the request below:

curl http://127.0.0.1:8500/v1/catalog/service/<service-name> | jq

After making sure that the node in fact no longer exists in the cluster, we run the command below to unregister all the services present in that node:

curl --request PUT -d '{"Node":"<node-name>"}' http://127.0.0.1:8500/v1/catalog/deregister

This solves the problem temporarily, but every time a node is killed or recreated the problem recurs and again we need to perform the steps above.

Expected behavior

We expected Consul to deregister all nodes and services that did not respond properly to healthcheck, thus excluding dead services from its catalog.

Environment details

The version of Consul-k8s in our environment that presents the issue goes from the v1.0.4 and App Version 1.14.4 to v1.1.1 of chart and App Version 1.15.1.
This is because, we identified the error in version 1.0.4 and after upgrading to version 1.1.1 the error persists.

The version of Kubernetes we used goes from 1.24.6 to 1.25.5.

Additional Context

The issue #1817 presents a very similar case with what we find in our environment.

andrewnazarov · 2023-06-01T13:10:10Z

We have kinda the same situation when we have nodes upgrade in GKE. Our use case might be a bit different, because we don't have the K8s synchronisation enabled - the registration and deregistration both are handled by our apps exclusively. However, we also have a timeout when a service gets deregistered forcefully if it's not terminated correctly or the app wasn't able to deregister itself correctly during shutdown. What we do see in common with this though is that we have unhealthy nodes after the k8s upgrade and unhealthy services too which cannot be deregistered due to the timeout because the node is red.

I was wondering if terminationGracePeriodSeconds could be configured for Consul client pods to mitigate this if it were the case. But I don't see the setting in the chart.

Our chart version is 1.0.4.

FelipeEmerim · 2023-06-20T13:44:00Z

This is a bigger problem than it seems to be. If the dead service's IP overlaps with a live service IP, envoy will create a passthrough cluster for the dead service and drop every connection. This is common in kubenet setups where pods use an internal network. Every time a node is removed we see dead services in Consul. In some cases as described in this issue, in other cases as in #2085.

This makes Consul very difficult to work with in spot environments and every cluster upgrade problematic. @andrewnazarov suggested tweaking terminationGracePeriodSeconds in the client nodes, but consul-k8s dropped client nodes after v1.0.0. Is there any other workaround on this issue?

david-yu · 2023-07-20T21:33:25Z

Closing as the PR that addresses this is now merged: #2571. This should be released in 1.2.x, 1.1.x, and 1.0.x by mid August timeframe for our next set of patch releases.

andrewnazarov · 2023-09-07T09:58:38Z

Version 1.1.4 didn't solve our problem. Still experiencing dead nodes and services.

komapa · 2023-11-03T16:52:30Z

1.2.x did not solve it for us either. We have 2000+ virtual nodes when our cluster is only 28 nodes right now.

david-yu · 2023-11-03T20:47:17Z

Could you test with the latest versions of Consul as well? If this still is an issue, we should perhaps re-open but we need more information on how to reproduce. I believe K8s 1.26 and above actually helps solve this issue with graceful node shutdown: https://kubernetes.io/blog/2023/08/16/kubernetes-1-28-non-graceful-node-shutdown-ga/#:~:text=The%20Graceful%20Node%20Shutdown%20feature,a%20non%2Dgraceful%20node%20shutdown.

MageshSrinivasulu · 2024-03-01T17:06:16Z

Facing the same issue with k8s 1.27 as well. Using consul 1.14.10

MageshSrinivasulu · 2024-07-31T05:53:32Z

Still facing the same issue with k8s 1.28 as well, using consul 1.16.6. This is a never-ending issue. @david-yu Please reopen this issue

Apart from deleting the node that doesn't exist anymore what helps me is scaling down the impacted service to zero and scaling it back up again which removes the duplicate or bad entries

Below is how I can consistently able to reproduce the issue

pod A running in node A.
cordon node A
Let the pod A schedule in node B
It leaves 2 entries of an instance in the consul catalog meaning the pod IP of both old and new pod A. Where the health of new pod A flips between healthy and unhealthy and old pod A entry is always unhealthy

This is Crazy

akrymets · 2024-11-27T22:29:50Z

Still facing this issue using consul 1.19.1 with AWS ECS cluster nodes.
I remember it since version 1.4.4 with no changes

Stephani0106 added the type/bug Something isn't working label Apr 18, 2023

FelipeEmerim mentioned this issue Apr 27, 2023

Wrong Instance Count because of Cluster Restart #2085

Closed

mr-miles mentioned this issue Jun 2, 2023

Handle errors properly when services are de-registered from the catalog #2258

Closed

2 tasks

mr-miles mentioned this issue Jun 29, 2023

BUG+FIX: Endpoints controller fails to deregister services #2491

Closed

curtbushko mentioned this issue Jul 20, 2023

Handle errors properly when services are de-registered from the catalog #2571

Merged

2 tasks

david-yu closed this as completed Jul 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nodes and dead services remaining in the consul catalog #2065

Nodes and dead services remaining in the consul catalog #2065

Stephani0106 commented Apr 18, 2023

andrewnazarov commented Jun 1, 2023 •

edited

Loading

FelipeEmerim commented Jun 20, 2023 •

edited

Loading

david-yu commented Jul 20, 2023

andrewnazarov commented Sep 7, 2023

komapa commented Nov 3, 2023

david-yu commented Nov 3, 2023

MageshSrinivasulu commented Mar 1, 2024 •

edited

Loading

MageshSrinivasulu commented Jul 31, 2024 •

edited

Loading

akrymets commented Nov 27, 2024 •

edited

Loading

Nodes and dead services remaining in the consul catalog #2065

Nodes and dead services remaining in the consul catalog #2065

Comments

Stephani0106 commented Apr 18, 2023

Community Note

Overview of the Issue

Palliative solution

Expected behavior

Environment details

Additional Context

andrewnazarov commented Jun 1, 2023 • edited Loading

FelipeEmerim commented Jun 20, 2023 • edited Loading

david-yu commented Jul 20, 2023

andrewnazarov commented Sep 7, 2023

komapa commented Nov 3, 2023

david-yu commented Nov 3, 2023

MageshSrinivasulu commented Mar 1, 2024 • edited Loading

MageshSrinivasulu commented Jul 31, 2024 • edited Loading

akrymets commented Nov 27, 2024 • edited Loading

andrewnazarov commented Jun 1, 2023 •

edited

Loading

FelipeEmerim commented Jun 20, 2023 •

edited

Loading

MageshSrinivasulu commented Mar 1, 2024 •

edited

Loading

MageshSrinivasulu commented Jul 31, 2024 •

edited

Loading

akrymets commented Nov 27, 2024 •

edited

Loading