-
Notifications
You must be signed in to change notification settings - Fork 326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race conditions between pods and cni enables tproxy to be bypassed #3060
Comments
I came across a great detailed write up of the issue from the Istio maintainers which also mentions how it is exacerbated when nodes start up https://docs.google.com/document/d/1SQzrFxtcn3o_79OtJYbSHMPuqZNhR3EaEhbkpBVMXAw/edit# |
Not sure why it closed it... |
Hi @mr-miles! When I was building consul-cni, the startup was definitely a problem. It is why I am trying to set all of the startup priorities that kubernetes will offer. As you know, you can still get out of order problems.
Thanks for this idea. I think failing the init container would be a suitable way to work around the problem and force kubernetes to restart the pod. We have all the information and the check could be quite isolated... I'll talk to @david-yu when he is back from HashiConf. |
Thanks @curtbushko that's great. I wanted to draw your attention to one detail: I believe the failing init container will cause k8s to restart that container, but not respawn the whole pod. Because the cni runs when the pod sandbox is set up for the containers to run within, failing the init container won't clear the problem completely, only prevent broken pods from completing their startup. Also, most of my issues have surfaced from our use of an autoscaler (karpenter in our case) and I was wondering if I could suggest adding that into your test setup since it seems very effective at surfacing race conditions. Also happy to try out a pre-release if that is useful Thanks for your help on this! |
@mr-miles I just thought of something and I am throwing out ideas. You are using karpenter for for scaling your nodes but what are you using for actual deployments? ArgoCD? A couple of ideas:
|
Thanks @curtbushko - we are investigating startup taints and will report back. Definitely agree that consul-cni should not need anything karpenter specific. I think the issues can be recreated in vanilla k8s environments by deleting a node (so that there is a long queue of pods waiting to be scheduled) and then adding a new node so that the queued pods are all scheduled in parallel as soon as the node comes up |
Hi @curtbushko We had success in using the startupTaint to make sure pods did not start before the CNI installation was completed - pods bypassing cni dropped to zero immediately. However getting the taint removed turns out to be quite heavyweight (we needed to write our own controller to handle it). Prior art you might be interested in is:
I'm wondering if the controller could remove the taint, since it has permission and subscribes to the right events? That's on the basis that this issue is generic (to all cnis and environments) rather than karpenter specific, which you might disagree with. I also noted that there is no way to tell which nodes have the cni installed successfully on the basis of labels, which might also help make orchestration easier. |
Hi @mr-miles, Thanks for the information. Could you clarify the taints aspect that you were using? If I am understanding correctly you put a startupTaint on the node as part of Karpenter? and then had your controller remove it once the cni daemonset was running? I believe that the consul-cni daemonset would have enough permissions to remove a taint from the node but I am not fond of requiring the Administrator to add a taint to the nodes as an extra step. The way Istio does it is probably the least heavy handed and I believe that I already have the beginnings of that with some of my annotations in CNI. consul-cni is already waiting on the init container to set the consul.hashicorp.com/redirect-traffic-config annotation so I see no reason that the init container cannot wait on CNI to set 'waiting'. It would be an easy way to retry, wait and fail the pod if CNI never set it to waiting. Then you could re-schedule the pod. |
Hi @curtbushko To clarify - yes that is exactly how it works. The karpenter nodeprofile includes a startupTaint that stops pods being scheduled onto it. The controller notices that the cni is complete and removes the taint, then the pods relying on cni can start. I definitely agree that taints seem quite invasive and low-level. But it has been 100% successful which should not be discounted! IMO it would be a plus if the option were available for unfortunate admins in this situation without having to write their own controller ... but it is too heavy to be part of a standard setup that all installations must follow. Waiting on the annotation will give safety that pods don't start up with broken networking, which will be great. But I believe it will still require the admin to create their own controller to reschedule, since k8s will respond by restarting the failed containers within the pod but won't recreate the pod network sandbox. From the earlier suggestions (many thanks!)
Interested if you've found any other examples or workarounds while you've been thinking about it |
I've been talking to my peers and I think we have a simple way to work around this that does not require taints and a controller.
The downside of using a taint is that it could disrupt all non-mesh services and cause problems if something goes wrong at the CNI install level of things. |
Hi @curtbushko I think this would be a great improvement and will definitely prevent this race condition in most cases - thanks! Presumably that extra nodeSelector gets added to any existing node selector a pod might define? I'm not totally certain whether it completely replaces the taints for karpenter/auto-scaling since the docs are quite hazy. They use the labels to work out what sort of node to provision, So if the label is not there to start with, its not clear if it just gets stuck by not creating any node. There are other ways to do this though, not least explicitly requiring a particular nodepool. |
Community Note
Overview of the Issue
We observe pods in our k8s cluster periodically unable to connect to external resources outside the service mesh - they are able to resolve dns fine and the envoy config is correct however the outbound traffic is not routed to the envoy outbound listener.
In all cases we consistently see that the pods have an annotation
consul.hashicorp.com/transparent-proxy-status=enabled
whereas pods that function normally have
consul.hashicorp.com/transparent-proxy-status=completed
From looking in the logs, it appears that this is occuring because the pods are starting up before the cni daemonset has run, and so the traffic rules to make the tproxy setup work are not being added to the pod. I believe if the cni has been installed and responds to the "ADD" command to set up the networking, then it will also update the annotation to completed - so the fact it stays at enabled indicates the cni is not in place at that point.
I also observe that killing the pod so it is rescheduled works consistently, provided the number of nodes simultaneously starting is not high enough to trigger a new node.
There are some specifics here from other service mesh implementations that use cni:
https://istio.io/latest/docs/setup/additional-setup/cni/#race-condition--mitigation
linkerd/linkerd2#8120 (race condition 2)
We are using the karpenter autoscaler in the cluster and I believe this exacerbates the issue as it shuffles pods around at the same time, with the intent of optimising the cluster, and often starts and drains nodes.
It is problematic because the pods starting without networking are then able to bypass the mesh for inbound and outbound traffic.
Reproduction Steps
As a race condition, it is hard to reproduce. Two routes:
destroy nodes and recreate, so that multiple pods and the cni daemonset are started at the same time
run a cluster with karpenter and wait ... from experience, it will only take a few cycles of optimisation to do it
Expected behavior
From reading around, ensuring the cni is running before the other pods is quite a complicated issue and not one easily supported by k8s. Other cni-based meshes fall back to blocking pod startup and a controller to reschedule broken pods.
In this case, a minimal "fix" that would be useful would be for the init-container to fail to start if the annotation is not set to completed - at least then we can detect the failing pods and take action. Istio has a controller looking for failed pods so it can "repair" (reschedule) them.
Environment details
EKS 1.27
Consul 1.16.2
Consul-k8s 1.2.2
AWS-CNI + consul cni
The text was updated successfully, but these errors were encountered: