-
Notifications
You must be signed in to change notification settings - Fork 748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EKS nodes getting into NotReady state with VPC CNI Addon v1.15.5-eksbuild.1 #2730
Comments
@slavhate this is difficult to debug without the node logs showing why the As an aside, for your comment on available IPs, we are working to improve this. For example, #2714 should remove the main reason that anyone uses Custom Networking. |
Focusing on support case, as you have already opened it , will be best. Usually the reason for these errors/ behavior is often found in the ipamd logs; which can be gathered as given in our troubleshooting section here: https://github.com/aws/amazon-vpc-cni-k8s/blob/master/docs/troubleshooting.md#collecting-node-level-tech-support-bundle-for-offline-troubleshooting |
I agree that resolving this issue requires examination of the logs from EKS nodes for a more in-depth analysis. Despite engaging AWS premium support, the root cause remains inconclusive. As the problem manifests in version 15.x while functioning in the 14.x family, AWS premium support advised me to escalate the matter as a bug report here. This allows developers to investigate further. I am willing to provide the logs for scrutiny (I hope they do not contain sensitive information). Alternatively, one can attempt to replicate the issue on a cluster comprising 25 or more nodes, as detailed in the 'how to reproduce' section. Also, can anyone here verify the healthy functionality of VPC CNI Addon: |
UpdateWe conducted some more tests and determined that the issue does not stem from the 1.15 version. The problem might have arose from recycling nodes in batches, we were doing 8-9 nodes simultaneously. This approach results in a huge volume of EC2 API calls, especially when
It's possible that this high volume is triggering rate limits, leading to API request throttling. When we tested the process with a single node recycle at a time, the operation proceeded without any issues. Does anyone have information on the number of API calls generated during the deletion and creation of an EKS node? Additionally, insights into EC2 rate limits would be appreciated. |
|
Thanks for the update. That makes sense.
If we have any readily available data, I will share. I don't have anything offhand. You can get this information by using cloud watch metrics - https://docs.aws.amazon.com/AWSEC2/latest/APIReference/monitor.html |
What happened:
Current setup
1.14.1-eksbuild.1
with Advanced configrations{ "eniConfig": { "create": true, "region": "us-west-2", "subnets": { "us-west-2a": { "id": "subnet-xxx", "securityGroups": [ "sg-xxx" ] }, "us-west-2b": { "id": "subnet-xxx", "securityGroups": [ "sg-xxx" ] }, "us-west-2c": { "id": "subnet-xxx", "securityGroups": [ "sg-xxx" ] } } }, "env": { "AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG": "true", "ENABLE_POD_ENI": "true", "ENABLE_PREFIX_DELEGATION": "true", "ENI_CONFIG_LABEL_DEF": "topology.kubernetes.io/zone", "MINIMUM_IP_TARGET": "15", "WARM_IP_TARGET": "8" }, "init": { "env": { "DISABLE_TCP_EARLY_DEMUX": "true" } } }
Ready
)We upgraded the cluster to 1.26 with Addon version updates. And recycled all the nodes.
Upgraded setup
v1.15.5-eksbuild.1
Ready
restNotReady
.Following the upgrade, upon removal of the old nodes and the introduction of new nodes by Karpenter, approximately 20 nodes successfully transitioned to the
Ready
state. However, the remaining nodes remained in theNotReady
state. It was noted thataws-node
pods failed to reach theRunning
state on theseNotReady
nodes.Working with AWS Premium support, we identified that setting "ENABLE_PREFIX_DELEGATION" to "false" resolved the issue, leading to all nodes transitioning to the
Ready
state. Nevertheless, this adjustment comes at the cost of losing the ability to support a maximum of 110 pods on a given node. For instance, am6i.large
node can now only allocate 30 IPs to the pods, causing the remaining pods to be stuck in acontainercreating
state due to a shortage of available IPs.We were wondering what was changed from 1.14 to 1.15 of VPC CNI that leads to this behaviour.
Attach logs
What you expected to happen:
After upgrading to EKS 1.26 with VPC CNI Addon
v1.15.5-eksbuild.1
all new nodes on the cluster should be inReady
state.How to reproduce it (as minimally and precisely as possible):
It can be reproduced by applying VPC CNI Addon:
v1.15.5-eksbuild.1
with above given custom config to the EKS cluster that has more than 25 nodes. Then recycle all the nodes to verify the behaviour.Anything else we need to know?:
Environment:
kubectl version
): 1.26cat /etc/os-release
): Amazon Linux 2uname -a
):The text was updated successfully, but these errors were encountered: