Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EKS nodes getting into NotReady state with VPC CNI Addon v1.15.5-eksbuild.1 #2730

Closed
slavhate opened this issue Dec 28, 2023 · 6 comments
Closed
Labels

Comments

@slavhate
Copy link

What happened:

Current setup

  • EKS 1.25 (us-west-2)
  • Node management via Karpenter (allowed to use selected 100~ instance types)
  • VPC CNI Addon: v1.14.1-eksbuild.1 with Advanced configrations

{ "eniConfig": { "create": true, "region": "us-west-2", "subnets": { "us-west-2a": { "id": "subnet-xxx", "securityGroups": [ "sg-xxx" ] }, "us-west-2b": { "id": "subnet-xxx", "securityGroups": [ "sg-xxx" ] }, "us-west-2c": { "id": "subnet-xxx", "securityGroups": [ "sg-xxx" ] } } }, "env": { "AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG": "true", "ENABLE_POD_ENI": "true", "ENABLE_PREFIX_DELEGATION": "true", "ENI_CONFIG_LABEL_DEF": "topology.kubernetes.io/zone", "MINIMUM_IP_TARGET": "15", "WARM_IP_TARGET": "8" }, "init": { "env": { "DISABLE_TCP_EARLY_DEMUX": "true" } } }

  • EKS cluster nodes: 40+ (All Ready)

We upgraded the cluster to 1.26 with Addon version updates. And recycled all the nodes.

Upgraded setup

  • EKS 1.26
  • VPC CNI Addon: v1.15.5-eksbuild.1
  • EKS Cluster nodes: 20~ Ready rest NotReady.

Following the upgrade, upon removal of the old nodes and the introduction of new nodes by Karpenter, approximately 20 nodes successfully transitioned to the Ready state. However, the remaining nodes remained in the NotReady state. It was noted that aws-node pods failed to reach the Running state on these NotReady nodes.

Working with AWS Premium support, we identified that setting "ENABLE_PREFIX_DELEGATION" to "false" resolved the issue, leading to all nodes transitioning to the Ready state. Nevertheless, this adjustment comes at the cost of losing the ability to support a maximum of 110 pods on a given node. For instance, a m6i.large node can now only allocate 30 IPs to the pods, causing the remaining pods to be stuck in a containercreating state due to a shortage of available IPs.

We were wondering what was changed from 1.14 to 1.15 of VPC CNI that leads to this behaviour.

Attach logs

What you expected to happen:

After upgrading to EKS 1.26 with VPC CNI Addon v1.15.5-eksbuild.1 all new nodes on the cluster should be in Ready state.

How to reproduce it (as minimally and precisely as possible):

It can be reproduced by applying VPC CNI Addon: v1.15.5-eksbuild.1 with above given custom config to the EKS cluster that has more than 25 nodes. Then recycle all the nodes to verify the behaviour.

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): 1.26
  • CNI Version: v1.15.5-eksbuild.1
  • OS (e.g: cat /etc/os-release): Amazon Linux 2
  • Kernel (e.g. uname -a):
@slavhate slavhate added the bug label Dec 28, 2023
@jdn5126
Copy link
Contributor

jdn5126 commented Dec 28, 2023

@slavhate this is difficult to debug without the node logs showing why the aws-node pods did not transition to the Running state. We should focus on that in the support case itself.

As an aside, for your comment on available IPs, we are working to improve this. For example, #2714 should remove the main reason that anyone uses Custom Networking.

@orsenthil
Copy link
Member

Focusing on support case, as you have already opened it , will be best. Usually the reason for these errors/ behavior is often found in the ipamd logs; which can be gathered as given in our troubleshooting section here: https://github.com/aws/amazon-vpc-cni-k8s/blob/master/docs/troubleshooting.md#collecting-node-level-tech-support-bundle-for-offline-troubleshooting

@slavhate
Copy link
Author

I agree that resolving this issue requires examination of the logs from EKS nodes for a more in-depth analysis. Despite engaging AWS premium support, the root cause remains inconclusive. As the problem manifests in version 15.x while functioning in the 14.x family, AWS premium support advised me to escalate the matter as a bug report here. This allows developers to investigate further.

I am willing to provide the logs for scrutiny (I hope they do not contain sensitive information). Alternatively, one can attempt to replicate the issue on a cluster comprising 25 or more nodes, as detailed in the 'how to reproduce' section. Also, can anyone here verify the healthy functionality of VPC CNI Addon: v1.15.5-eksbuild.1 (with prefix delegation config) on a cluster with 25+ nodes, particularly after nodes undergo recycling post-Addon installation or upgrade?

@slavhate
Copy link
Author

Update

We conducted some more tests and determined that the issue does not stem from the 1.15 version. The problem might have arose from recycling nodes in batches, we were doing 8-9 nodes simultaneously. This approach results in a huge volume of EC2 API calls, especially when prefix_delegation is enabled.
Quoting from documentation.

The reason to be careful with this setting is that it will increase the number of EC2 API calls that ipamd has to do to attach and detach IPs to the instance. If the number of calls gets too high, they will get throttled and no new ENIs or IPs can be attached to any instance in the cluster.

It's possible that this high volume is triggering rate limits, leading to API request throttling. When we tested the process with a single node recycle at a time, the operation proceeded without any issues. Does anyone have information on the number of API calls generated during the deletion and creation of an EKS node? Additionally, insights into EC2 rate limits would be appreciated.

@slavhate slavhate closed this as not planned Won't fix, can't repro, duplicate, stale Dec 29, 2023
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

@orsenthil
Copy link
Member

Thanks for the update. That makes sense.

Does anyone have information on the number of API calls generated during the deletion and creation of an EKS node? Additionally, insights into EC2 rate limits would be appreciated.

If we have any readily available data, I will share. I don't have anything offhand.

You can get this information by using cloud watch metrics - https://docs.aws.amazon.com/AWSEC2/latest/APIReference/monitor.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants