-
Notifications
You must be signed in to change notification settings - Fork 752
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CNI ipamd reconcilation of in-use addresses #3109
Comments
The CNI today only reconciles its datastore with existing pods at startup but never again. Sometimes its possible that IPAMD goes out of sync with the kubelet's view of the pods running on the node if it fails or is temporarily unreachable by the CNI plugin handling the DelNetwork call from the contrainer runtime. In such cases the CNI continues to consider the pods IP allocated and will not free it as it will never see a DelNetwork again. This results in CNI failing to assign IP's to new pods. This change adds a reconcile loop which periodically (once a minute) reconciles its allocated IPs with existence of pod's veth devices. If the veth device is not found then it free's up the corresponding allocation making the IP available for reuse. Fixes aws#3109
Hi @hbhasker , thank you for this report. Could you confirm this by looking at the ipamd.log If you share the logs in k8s-awscni-triage@amazon.com, we can look over it too. |
I did confirm by looking at the json for the datastore as well. It clearly had pods in there that had already terminated on the node. I will see if i run into another occurence of the same and capture more information. |
The CNI today only reconciles its datastore with existing pods at startup but never again. Sometimes its possible that IPAMD goes out of sync with the kubelet's view of the pods running on the node if it fails or is temporarily unreachable by the CNI plugin handling the DelNetwork call from the contrainer runtime. In such cases the CNI continues to consider the pods IP allocated and will not free it as it will never see a DelNetwork again. This results in CNI failing to assign IP's to new pods. This change adds a reconcile loop which periodically (once a minute) reconciles its allocated IPs with existence of pod's veth devices. If the veth device is not found then it free's up the corresponding allocation making the IP available for reuse. Fixes aws#3109
@orsenthil - We will have to check plugin logs if the delete request landed on CNI. Since kubelet is the source of truth. I don't think we should add more reconcilers rather check why the event was missed or not received.. |
Any update on this ? |
We just hit this on another node where IPAMD thinks there are 64 pods running when in fact there are less than 50. It assumes all 64 IPs are allocated and refuses to allocate any IPs failing pod startup on the node. |
@hbhasker - Could you collect the logs on this node and share to the email mentioned - https://github.com/aws/amazon-vpc-cni-k8s/blob/master/docs/troubleshooting.md#collecting-node-level-tech-support-bundle-for-offline-troubleshooting |
@orsenthil I have opened an AWS support case and shared relevant information via that channel as I prefer not to send logs from our system via email. I have asked our AWS partners to coordinate with you so that they can get the relevant information to you. |
Thank you, @hbhasker. I will review the logs. |
Something we observed today. It might also be a bug in prefix delegation. We noticed that the CNI wasn't attaching new prefixes after 4 of them even when all IP were in use. I will write some tests to check the logic where it checks if data store is low on IPs. |
The CNI today only reconciles its datastore with existing pods at startup but never again. Sometimes its possible that IPAMD goes out of sync with the kubelet's view of the pods running on the node if it fails or is temporarily unreachable by the CNI plugin handling the DelNetwork call from the contrainer runtime. In such cases the CNI continues to consider the pods IP allocated and will not free it as it will never see a DelNetwork again. This results in CNI failing to assign IP's to new pods. This change adds a reconcile loop which periodically (once a minute) reconciles its allocated IPs with existence of pod's veth devices. If the veth device is not found then it free's up the corresponding allocation making the IP available for reuse. Fixes aws#3109
I looked the initial logs that you had updated. I see that assigned IPs did go to 64 in the logs, I noticed the ip assignment going from 55 to 58 and reallocating ips, and then increasing. In general, I couldn't see an error with del network calls too. Something else might be happening here. I have requested to log bundles (on a working and non-working) nodes through the support case. |
So I was looking through the code and I realized getPrefixesNeeded as written is buggy. If WARM_IP_TARGET is defined it returns # of IPs required instead of # of prefixes. See: amazon-vpc-cni-k8s/pkg/ipamd/ipamd.go Line 2253 in a121a8a
|
@hbhasker - You can check these lines - amazon-vpc-cni-k8s/pkg/ipamd/ipamd.go Lines 2247 to 2249 in a121a8a
amazon-vpc-cni-k8s/pkg/ipamd/ipamd.go Lines 1860 to 1872 in a121a8a
|
Ah thanks for that. Maybe a cleanup would be to not overload the function to handle both prefix and non prefix and make two functions and be explicit about which one is called. It would make the code a lot more easier to reason about. |
So I was looking again at what might have happened and why Prefix Delegation wasn't attaching new IPs and I think the reason is we set maxPods at 62 and this check causes it to return and not attach new IPs amazon-vpc-cni-k8s/pkg/ipamd/ipamd.go Line 2161 in a121a8a
So there is no bug in Prefix delegation but just a bug in inconsistency between ipamd datastore and the container runtime's view of pods running on the node. |
What happened:
We noticed that on some hosts the CNI was thinking 64 IPs were in use with pods that had long terminated. When we checked the node had only 59 pods (including ones that were using host networking) but CNI clearly thought there were 64 pods running and failed allocating new IPs to pods because all IPs were in use (we set the max pods to 64 for the node). We spent sometime trying to figure out how that happens but I guess it can happen if somehow the CNI missing the delete event or fails to process it.
I was trying to read the code and see if there is some race where a delete and create can race causing CNI to incorrectly reject the delete and then proceed to add the IP to ipamd as allocated. In which case the IP remains in use even though the pod is gone. (Its possible I am misunderstanding what kubelet /crio do when a pod is terminated and if the CNI fails the DelNetwork request with an error).
Mostly looking to understand if this is a known issue? Looks like CNI does reconcile its database on a restart but maybe it needs to reconcile it periodically to prevent this?
Environment:
kubectl version
): 1.26cat /etc/os-release
): Amazon Linux 2023uname -a
): 6.1The text was updated successfully, but these errors were encountered: