Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race conditions between pods and cni enables tproxy to be bypassed #3060

Open
mr-miles opened this issue Oct 6, 2023 · 11 comments
Open

Race conditions between pods and cni enables tproxy to be bypassed #3060

mr-miles opened this issue Oct 6, 2023 · 11 comments
Labels
type/bug Something isn't working

Comments

@mr-miles
Copy link
Contributor

mr-miles commented Oct 6, 2023

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request. Searching for pre-existing feature requests helps us consolidate datapoints for identical requirements into a single place, thank you!
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request.
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment.

Overview of the Issue

We observe pods in our k8s cluster periodically unable to connect to external resources outside the service mesh - they are able to resolve dns fine and the envoy config is correct however the outbound traffic is not routed to the envoy outbound listener.

In all cases we consistently see that the pods have an annotation
consul.hashicorp.com/transparent-proxy-status=enabled
whereas pods that function normally have
consul.hashicorp.com/transparent-proxy-status=completed

From looking in the logs, it appears that this is occuring because the pods are starting up before the cni daemonset has run, and so the traffic rules to make the tproxy setup work are not being added to the pod. I believe if the cni has been installed and responds to the "ADD" command to set up the networking, then it will also update the annotation to completed - so the fact it stays at enabled indicates the cni is not in place at that point.

I also observe that killing the pod so it is rescheduled works consistently, provided the number of nodes simultaneously starting is not high enough to trigger a new node.

There are some specifics here from other service mesh implementations that use cni:
https://istio.io/latest/docs/setup/additional-setup/cni/#race-condition--mitigation
linkerd/linkerd2#8120 (race condition 2)

We are using the karpenter autoscaler in the cluster and I believe this exacerbates the issue as it shuffles pods around at the same time, with the intent of optimising the cluster, and often starts and drains nodes.

It is problematic because the pods starting without networking are then able to bypass the mesh for inbound and outbound traffic.

Reproduction Steps

As a race condition, it is hard to reproduce. Two routes:

  • destroy nodes and recreate, so that multiple pods and the cni daemonset are started at the same time

  • run a cluster with karpenter and wait ... from experience, it will only take a few cycles of optimisation to do it

Expected behavior

From reading around, ensuring the cni is running before the other pods is quite a complicated issue and not one easily supported by k8s. Other cni-based meshes fall back to blocking pod startup and a controller to reschedule broken pods.

In this case, a minimal "fix" that would be useful would be for the init-container to fail to start if the annotation is not set to completed - at least then we can detect the failing pods and take action. Istio has a controller looking for failed pods so it can "repair" (reschedule) them.

Environment details

EKS 1.27
Consul 1.16.2
Consul-k8s 1.2.2
AWS-CNI + consul cni

@mr-miles mr-miles added the type/bug Something isn't working label Oct 6, 2023
@mr-miles
Copy link
Contributor Author

mr-miles commented Oct 7, 2023

I came across a great detailed write up of the issue from the Istio maintainers which also mentions how it is exacerbated when nodes start up

https://docs.google.com/document/d/1SQzrFxtcn3o_79OtJYbSHMPuqZNhR3EaEhbkpBVMXAw/edit#

@mr-miles mr-miles closed this as completed Oct 7, 2023
@mr-miles
Copy link
Contributor Author

mr-miles commented Oct 9, 2023

Not sure why it closed it...

@mr-miles mr-miles reopened this Oct 9, 2023
@curtbushko
Copy link
Contributor

Hi @mr-miles!

When I was building consul-cni, the startup was definitely a problem. It is why I am trying to set all of the startup priorities that kubernetes will offer. As you know, you can still get out of order problems.

In this case, a minimal "fix" that would be useful would be for the init-container to fail to start if the annotation is not set to completed - at least then we can detect the failing pods and take action. Istio has a controller looking for failed pods so it can "repair" (reschedule) them.

Thanks for this idea. I think failing the init container would be a suitable way to work around the problem and force kubernetes to restart the pod. We have all the information and the check could be quite isolated...

I'll talk to @david-yu when he is back from HashiConf.

@mr-miles
Copy link
Contributor Author

mr-miles commented Oct 13, 2023

Thanks @curtbushko that's great.

I wanted to draw your attention to one detail: I believe the failing init container will cause k8s to restart that container, but not respawn the whole pod. Because the cni runs when the pod sandbox is set up for the containers to run within, failing the init container won't clear the problem completely, only prevent broken pods from completing their startup.

Also, most of my issues have surfaced from our use of an autoscaler (karpenter in our case) and I was wondering if I could suggest adding that into your test setup since it seems very effective at surfacing race conditions.

Also happy to try out a pre-release if that is useful

Thanks for your help on this!

@curtbushko
Copy link
Contributor

@mr-miles I just thought of something and I am throwing out ideas.

You are using karpenter for for scaling your nodes but what are you using for actual deployments? ArgoCD?

A couple of ideas:

  • I was wondering if you could add a pre-hook to your deployments that make sure that the cni daemonset is running before deploying?
  • I have just started reading about Karpenter but is it possible to add the daemonset to the node template itself?
  • I see that the Karpenter provisioner has a post start taint:
  # Provisioned nodes will have these taints, but pods do not need to tolerate these taints to be provisioned by this
  # provisioner. These taints are expected to be temporary and some other entity (e.g. a DaemonSet) is responsible for
  # removing the taint after it has finished initializing the node.
  startupTaints:
    - key: example.com/another-taint
      effect: NoSchedule
  • with the above taint, I wonder if you could get another daemonset that makes sure everything is running first? I don't think we would want to add any karpenter specific checks to our consul-cni...

@mr-miles
Copy link
Contributor Author

Thanks @curtbushko - we are investigating startup taints and will report back.

Definitely agree that consul-cni should not need anything karpenter specific. I think the issues can be recreated in vanilla k8s environments by deleting a node (so that there is a long queue of pods waiting to be scheduled) and then adding a new node so that the queued pods are all scheduled in parallel as soon as the node comes up

@mr-miles
Copy link
Contributor Author

Hi @curtbushko

We had success in using the startupTaint to make sure pods did not start before the CNI installation was completed - pods bypassing cni dropped to zero immediately. However getting the taint removed turns out to be quite heavyweight (we needed to write our own controller to handle it).

Prior art you might be interested in is:

  • one of my colleagues noticed istio checks that the cni is installed in the webhook, which can introduce enough of a delay to prevent the problem
  • cillum controls for the same issue with a taint that it can remove once the cni is installed - https://docs.cilium.io/en/stable/installation/taints/

I'm wondering if the controller could remove the taint, since it has permission and subscribes to the right events? That's on the basis that this issue is generic (to all cnis and environments) rather than karpenter specific, which you might disagree with.

I also noted that there is no way to tell which nodes have the cni installed successfully on the basis of labels, which might also help make orchestration easier.

@curtbushko
Copy link
Contributor

Hi @mr-miles,

Thanks for the information.

Could you clarify the taints aspect that you were using? If I am understanding correctly you put a startupTaint on the node as part of Karpenter? and then had your controller remove it once the cni daemonset was running?

I believe that the consul-cni daemonset would have enough permissions to remove a taint from the node but I am not fond of requiring the Administrator to add a taint to the nodes as an extra step.

The way Istio does it is probably the least heavy handed and I believe that I already have the beginnings of that with some of my annotations in CNI. consul-cni is already waiting on the init container to set the consul.hashicorp.com/redirect-traffic-config annotation so I see no reason that the init container cannot wait on CNI to set 'waiting'. It would be an easy way to retry, wait and fail the pod if CNI never set it to waiting. Then you could re-schedule the pod.

@mr-miles
Copy link
Contributor Author

Hi @curtbushko

To clarify - yes that is exactly how it works. The karpenter nodeprofile includes a startupTaint that stops pods being scheduled onto it. The controller notices that the cni is complete and removes the taint, then the pods relying on cni can start.

I definitely agree that taints seem quite invasive and low-level. But it has been 100% successful which should not be discounted! IMO it would be a plus if the option were available for unfortunate admins in this situation without having to write their own controller ... but it is too heavy to be part of a standard setup that all installations must follow.

Waiting on the annotation will give safety that pods don't start up with broken networking, which will be great. But I believe it will still require the admin to create their own controller to reschedule, since k8s will respond by restarting the failed containers within the pod but won't recreate the pod network sandbox.

From the earlier suggestions (many thanks!)

  • we deploy with helm rather than argo currently
  • karpenter runs separately from the deployment pipeline so we can't "pre-target" nodes like that
  • customising the nodetemplate/base image/startup script look very involved in EKS, and mean we would have to manage a release cycle with the customisations rather than just using the AWS published artefacts

Interested if you've found any other examples or workarounds while you've been thinking about it

@curtbushko
Copy link
Contributor

@mr-miles

I've been talking to my peers and I think we have a simple way to work around this that does not require taints and a controller.

  • We can use nodeSelector on all injected pods with something like consul.hashicorp.com/cni=ready
  • Once the consul cni daemonset is finished running it could set the label on the node
  • We'd only set this when CNI is enabled in the helm chart.

The downside of using a taint is that it could disrupt all non-mesh services and cause problems if something goes wrong at the CNI install level of things.

@mr-miles
Copy link
Contributor Author

Hi @curtbushko

I think this would be a great improvement and will definitely prevent this race condition in most cases - thanks! Presumably that extra nodeSelector gets added to any existing node selector a pod might define?

I'm not totally certain whether it completely replaces the taints for karpenter/auto-scaling since the docs are quite hazy. They use the labels to work out what sort of node to provision, So if the label is not there to start with, its not clear if it just gets stuck by not creating any node. There are other ways to do this though, not least explicitly requiring a particular nodepool.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants