EKS nodes getting into NotReady state with VPC CNI Addon v1.15.5-eksbuild.1 #2730

slavhate · 2023-12-28T10:09:09Z

What happened:

Current setup

EKS 1.25 (us-west-2)
Node management via Karpenter (allowed to use selected 100~ instance types)
VPC CNI Addon: v1.14.1-eksbuild.1 with Advanced configrations

{ "eniConfig": { "create": true, "region": "us-west-2", "subnets": { "us-west-2a": { "id": "subnet-xxx", "securityGroups": [ "sg-xxx" ] }, "us-west-2b": { "id": "subnet-xxx", "securityGroups": [ "sg-xxx" ] }, "us-west-2c": { "id": "subnet-xxx", "securityGroups": [ "sg-xxx" ] } } }, "env": { "AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG": "true", "ENABLE_POD_ENI": "true", "ENABLE_PREFIX_DELEGATION": "true", "ENI_CONFIG_LABEL_DEF": "topology.kubernetes.io/zone", "MINIMUM_IP_TARGET": "15", "WARM_IP_TARGET": "8" }, "init": { "env": { "DISABLE_TCP_EARLY_DEMUX": "true" } } }

EKS cluster nodes: 40+ (All Ready)

We upgraded the cluster to 1.26 with Addon version updates. And recycled all the nodes.

Upgraded setup

EKS 1.26
VPC CNI Addon: v1.15.5-eksbuild.1
EKS Cluster nodes: 20~ Ready rest NotReady.

Following the upgrade, upon removal of the old nodes and the introduction of new nodes by Karpenter, approximately 20 nodes successfully transitioned to the Ready state. However, the remaining nodes remained in the NotReady state. It was noted that aws-node pods failed to reach the Running state on these NotReady nodes.

Working with AWS Premium support, we identified that setting "ENABLE_PREFIX_DELEGATION" to "false" resolved the issue, leading to all nodes transitioning to the Ready state. Nevertheless, this adjustment comes at the cost of losing the ability to support a maximum of 110 pods on a given node. For instance, a m6i.large node can now only allocate 30 IPs to the pods, causing the remaining pods to be stuck in a containercreating state due to a shortage of available IPs.

We were wondering what was changed from 1.14 to 1.15 of VPC CNI that leads to this behaviour.

Attach logs

What you expected to happen:

After upgrading to EKS 1.26 with VPC CNI Addon v1.15.5-eksbuild.1 all new nodes on the cluster should be in Ready state.

How to reproduce it (as minimally and precisely as possible):

It can be reproduced by applying VPC CNI Addon: v1.15.5-eksbuild.1 with above given custom config to the EKS cluster that has more than 25 nodes. Then recycle all the nodes to verify the behaviour.

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version): 1.26
CNI Version: v1.15.5-eksbuild.1
OS (e.g: cat /etc/os-release): Amazon Linux 2
Kernel (e.g. uname -a):

The text was updated successfully, but these errors were encountered:

jdn5126 · 2023-12-28T16:30:51Z

@slavhate this is difficult to debug without the node logs showing why the aws-node pods did not transition to the Running state. We should focus on that in the support case itself.

As an aside, for your comment on available IPs, we are working to improve this. For example, #2714 should remove the main reason that anyone uses Custom Networking.

orsenthil · 2023-12-28T17:39:38Z

Focusing on support case, as you have already opened it , will be best. Usually the reason for these errors/ behavior is often found in the ipamd logs; which can be gathered as given in our troubleshooting section here: https://github.com/aws/amazon-vpc-cni-k8s/blob/master/docs/troubleshooting.md#collecting-node-level-tech-support-bundle-for-offline-troubleshooting

slavhate · 2023-12-29T04:47:08Z

I agree that resolving this issue requires examination of the logs from EKS nodes for a more in-depth analysis. Despite engaging AWS premium support, the root cause remains inconclusive. As the problem manifests in version 15.x while functioning in the 14.x family, AWS premium support advised me to escalate the matter as a bug report here. This allows developers to investigate further.

I am willing to provide the logs for scrutiny (I hope they do not contain sensitive information). Alternatively, one can attempt to replicate the issue on a cluster comprising 25 or more nodes, as detailed in the 'how to reproduce' section. Also, can anyone here verify the healthy functionality of VPC CNI Addon: v1.15.5-eksbuild.1 (with prefix delegation config) on a cluster with 25+ nodes, particularly after nodes undergo recycling post-Addon installation or upgrade?

slavhate · 2023-12-29T06:28:07Z

Update

We conducted some more tests and determined that the issue does not stem from the 1.15 version. The problem might have arose from recycling nodes in batches, we were doing 8-9 nodes simultaneously. This approach results in a huge volume of EC2 API calls, especially when prefix_delegation is enabled.
Quoting from documentation.

The reason to be careful with this setting is that it will increase the number of EC2 API calls that ipamd has to do to attach and detach IPs to the instance. If the number of calls gets too high, they will get throttled and no new ENIs or IPs can be attached to any instance in the cluster.

It's possible that this high volume is triggering rate limits, leading to API request throttling. When we tested the process with a single node recycle at a time, the operation proceeded without any issues. Does anyone have information on the number of API calls generated during the deletion and creation of an EKS node? Additionally, insights into EC2 rate limits would be appreciated.

github-actions · 2023-12-29T14:30:30Z

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

orsenthil · 2023-12-29T14:31:21Z

Thanks for the update. That makes sense.

Does anyone have information on the number of API calls generated during the deletion and creation of an EKS node? Additionally, insights into EC2 rate limits would be appreciated.

If we have any readily available data, I will share. I don't have anything offhand.

You can get this information by using cloud watch metrics - https://docs.aws.amazon.com/AWSEC2/latest/APIReference/monitor.html

slavhate added the bug label Dec 28, 2023

slavhate closed this as not planned Won't fix, can't repro, duplicate, stale Dec 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EKS nodes getting into NotReady state with VPC CNI Addon v1.15.5-eksbuild.1 #2730

EKS nodes getting into NotReady state with VPC CNI Addon v1.15.5-eksbuild.1 #2730

slavhate commented Dec 28, 2023

jdn5126 commented Dec 28, 2023

orsenthil commented Dec 28, 2023

slavhate commented Dec 29, 2023

slavhate commented Dec 29, 2023

github-actions bot commented Dec 29, 2023

orsenthil commented Dec 29, 2023