Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CNI ipamd reconcilation of in-use addresses #3109

Open
hbhasker opened this issue Nov 8, 2024 · 14 comments · May be fixed by #3113
Open

CNI ipamd reconcilation of in-use addresses #3109

hbhasker opened this issue Nov 8, 2024 · 14 comments · May be fixed by #3113

Comments

@hbhasker
Copy link

hbhasker commented Nov 8, 2024

What happened:
We noticed that on some hosts the CNI was thinking 64 IPs were in use with pods that had long terminated. When we checked the node had only 59 pods (including ones that were using host networking) but CNI clearly thought there were 64 pods running and failed allocating new IPs to pods because all IPs were in use (we set the max pods to 64 for the node). We spent sometime trying to figure out how that happens but I guess it can happen if somehow the CNI missing the delete event or fails to process it.

I was trying to read the code and see if there is some race where a delete and create can race causing CNI to incorrectly reject the delete and then proceed to add the IP to ipamd as allocated. In which case the IP remains in use even though the pod is gone. (Its possible I am misunderstanding what kubelet /crio do when a pod is terminated and if the CNI fails the DelNetwork request with an error).

Mostly looking to understand if this is a known issue? Looks like CNI does reconcile its database on a restart but maybe it needs to reconcile it periodically to prevent this?

Environment:

  • Kubernetes version (use kubectl version): 1.26
  • CNI Version: 1.18.2
  • OS (e.g: cat /etc/os-release): Amazon Linux 2023
  • Kernel (e.g. uname -a): 6.1
hbhasker pushed a commit to hbhasker/amazon-vpc-cni-k8s that referenced this issue Nov 13, 2024
The CNI today only reconciles its datastore with existing pods at
startup but never again. Sometimes its possible that IPAMD goes
out of sync with the kubelet's view of the pods running on the
node if it fails or is temporarily unreachable by the CNI plugin
handling the DelNetwork call from the contrainer runtime.

In such cases the CNI continues to consider the pods IP allocated
and will not free it as it will never see a DelNetwork again. This
results in CNI failing to assign IP's to new pods.

This change adds a reconcile loop which periodically (once a minute)
reconciles its allocated IPs with existence of pod's veth devices. If
the veth device is not found then it free's up the corresponding
allocation making the IP available for reuse.

Fixes aws#3109
@hbhasker hbhasker linked a pull request Nov 13, 2024 that will close this issue
@orsenthil
Copy link
Member

Hi @hbhasker , thank you for this report. Could you confirm this by looking at the ipamd.log If you share the logs in k8s-awscni-triage@amazon.com, we can look over it too.

@hbhasker
Copy link
Author

I did confirm by looking at the json for the datastore as well. It clearly had pods in there that had already terminated on the node. I will see if i run into another occurence of the same and capture more information.

hbhasker pushed a commit to hbhasker/amazon-vpc-cni-k8s that referenced this issue Nov 13, 2024
The CNI today only reconciles its datastore with existing pods at
startup but never again. Sometimes its possible that IPAMD goes
out of sync with the kubelet's view of the pods running on the
node if it fails or is temporarily unreachable by the CNI plugin
handling the DelNetwork call from the contrainer runtime.

In such cases the CNI continues to consider the pods IP allocated
and will not free it as it will never see a DelNetwork again. This
results in CNI failing to assign IP's to new pods.

This change adds a reconcile loop which periodically (once a minute)
reconciles its allocated IPs with existence of pod's veth devices. If
the veth device is not found then it free's up the corresponding
allocation making the IP available for reuse.

Fixes aws#3109
@jayanthvn
Copy link
Contributor

@orsenthil - We will have to check plugin logs if the delete request landed on CNI. Since kubelet is the source of truth. I don't think we should add more reconcilers rather check why the event was missed or not received..

@hbhasker
Copy link
Author

Any update on this ?

@hbhasker
Copy link
Author

hbhasker commented Dec 2, 2024

We just hit this on another node where IPAMD thinks there are 64 pods running when in fact there are less than 50. It assumes all 64 IPs are allocated and refuses to allocate any IPs failing pod startup on the node.

@orsenthil
Copy link
Member

@hbhasker
Copy link
Author

hbhasker commented Dec 3, 2024

@orsenthil I have opened an AWS support case and shared relevant information via that channel as I prefer not to send logs from our system via email. I have asked our AWS partners to coordinate with you so that they can get the relevant information to you.

@orsenthil
Copy link
Member

Thank you, @hbhasker. I will review the logs.

@hbhasker
Copy link
Author

hbhasker commented Dec 4, 2024

Something we observed today. It might also be a bug in prefix delegation. We noticed that the CNI wasn't attaching new prefixes after 4 of them even when all IP were in use. I will write some tests to check the logic where it checks if data store is low on IPs.

hbhasker pushed a commit to hbhasker/amazon-vpc-cni-k8s that referenced this issue Dec 4, 2024
The CNI today only reconciles its datastore with existing pods at
startup but never again. Sometimes its possible that IPAMD goes
out of sync with the kubelet's view of the pods running on the
node if it fails or is temporarily unreachable by the CNI plugin
handling the DelNetwork call from the contrainer runtime.

In such cases the CNI continues to consider the pods IP allocated
and will not free it as it will never see a DelNetwork again. This
results in CNI failing to assign IP's to new pods.

This change adds a reconcile loop which periodically (once a minute)
reconciles its allocated IPs with existence of pod's veth devices. If
the veth device is not found then it free's up the corresponding
allocation making the IP available for reuse.

Fixes aws#3109
@orsenthil
Copy link
Member

When we checked the node had only 59 pods (including ones that were using host networking) but CNI clearly thought there were 64 pods running and failed allocating new IPs to pods because all IPs were in use (we set the max pods to 64 for the node).

I looked the initial logs that you had updated. I see that assigned IPs did go to 64 in the logs, I noticed the ip assignment going from 55 to 58 and reallocating ips, and then increasing. In general, I couldn't see an error with del network calls too. Something else might be happening here. I have requested to log bundles (on a working and non-working) nodes through the support case.

@hbhasker
Copy link
Author

hbhasker commented Dec 6, 2024

So I was looking through the code and I realized getPrefixesNeeded as written is buggy. If WARM_IP_TARGET is defined it returns # of IPs required instead of # of prefixes.

See:

toAllocate = max(toAllocate, short)
that should return toAllocate/16 + 1. That said this may or may not be the issue, I am still trying to understand the code flow a bit.

@jayanthvn
Copy link
Contributor

@hbhasker - You can check these lines -

// datastoreTargetState already has complex math so adding Prefix target will make it even more complex.
short, _, warmIPTargetsDefined := c.datastoreTargetState(nil)
shortPrefixes, warmPrefixTargetDefined := c.datastorePrefixTargetState()

datastoreTargetState we would compute the needed prefixes -

// Over will have number of IPs more than needed but with PD we would have allocated in chunks of /28
// Say assigned = 1, warm ip target = 16, this will need 2 prefixes. But over will return 15.
// Hence we need to check if 'over' number of IPs are needed to maintain the warm targets
prefixNeededForWarmIP := datastore.DivCeil(stats.AssignedIPs+c.warmIPTarget, numIPsPerPrefix)
prefixNeededForMinIP := datastore.DivCeil(c.minimumIPTarget, numIPsPerPrefix)
// over will be number of prefixes over than needed but could be spread across used prefixes,
// say, after couple of pod churns, 3 prefixes are allocated with 1 IP each assigned and warm ip target is 15
// (J : is this needed? since we have to walk thru the loop of prefixes)
freePrefixes := c.dataStore.GetFreePrefixes()
overPrefix := max(min(freePrefixes, stats.TotalPrefixes-prefixNeededForWarmIP), 0)
overPrefix = max(min(overPrefix, stats.TotalPrefixes-prefixNeededForMinIP), 0)
return shortPrefix, overPrefix, true

@hbhasker
Copy link
Author

hbhasker commented Dec 6, 2024

Ah thanks for that. Maybe a cleanup would be to not overload the function to handle both prefix and non prefix and make two functions and be explicit about which one is called. It would make the code a lot more easier to reason about.

@hbhasker
Copy link
Author

hbhasker commented Dec 6, 2024

So I was looking again at what might have happened and why Prefix Delegation wasn't attaching new IPs and I think the reason is we set maxPods at 62 and this check causes it to return and not attach new IPs

if stats.TotalIPs >= c.maxPods {

So there is no bug in Prefix delegation but just a bug in inconsistency between ipamd datastore and the container runtime's view of pods running on the node.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants