Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod stuck in ContainerCreating after upgrading cluster to 1.29 #2980

Open
zendesk-yumingdeng opened this issue Jul 5, 2024 · 11 comments
Open
Labels

Comments

@zendesk-yumingdeng
Copy link

What happened:

We are experiencing something similar to #2970, after upgrading our in-house clusters to 1.29.
After a new node is brought up (not this does not happen to every node), some pods that were scheduled to the node are stuck in the ContainerCreating status with the below event message:

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "4e174229a28e7e3df61ece1a4320cc6581304664ea39186ab52281a283113a3a": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container
  • No error messages can be found in the aws-cni pod logs on the node
  • Not many details can be found in /var/log/aws-routed-eni/plugin.log
  • Found below errors in /var/log/aws-routed-eni/ipamd.log
{"level":"error","ts":"2024-07-05T04:28:59.406Z","caller":"eventrecorder/eventrecorder.go:67","msg":"Cached client failed GET pod (aws-cni-9w9vm)"}
{"level":"error","ts":"2024-07-05T04:28:59.406Z","caller":"aws-k8s-agent/main.go:63","msg":"Failed to find host aws-node pod: Pod \"aws-cni-9w9vm\" not found"}
{"level":"error","ts":"2024-07-05T04:31:02.334Z","caller":"datastore/data_store.go:652","msg":"DataStore has no available IP/Prefix addresses"}
{"level":"warn","ts":"2024-07-05T04:31:02.352Z","caller":"ipamd/rpc_handler.go:230","msg":"UnassignPodIPAddress: Failed to find sandbox _migrated-from-cri/16b5c95f6cb3ab32266a00048d184aff67a36d5ab730a4b3af3296b92ddff514/unknown"}
{"level":"warn","ts":"2024-07-05T04:34:37.660Z","caller":"ipamd/rpc_handler.go:230","msg":"UnassignPodIPAddress: Failed to find sandbox _migrated-from-cri/12808abe62be1050ba4da91b52e7339d4926e93ee9dc02989dc8653415610a5d/unknown"}

Environment:

  • Kubernetes version (use kubectl version): v1.29.6
  • CNI Version: v1.14.1
  • OS (e.g: cat /etc/os-release): Ubuntu 22.04.4 LTS
  • Kernel (e.g. uname -a): Linux 6.5.0-1022-aws Name in example yaml is confusing #22~22.04.1-Ubuntu SMP Fri Jun 14 19:23:09 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux
@yash97
Copy link
Contributor

yash97 commented Jul 10, 2024

Hey @zendesk-yumingdeng ,

I noticed the log message: {"level":"error","ts":"2024-07-05T04:31:02.334Z","caller":"datastore/data_store.go:652","msg":"DataStore has no available IP/Prefix addresses"}

It looks like all IPs are exhausted. Could you let me know how many pods are running on the new node that was brought up? Also, what kind of node is it in terms of capacity? TIA!

@pdallegrave
Copy link

Hi,

We are facing the same issue. In a specific environment we have 3 nodes running c6g/c7g instances with medium size. VPC CNI is used with Security Groups per pod.
Kubernetes version is 1.30 and VPC CNIC v1.18.2-eksbuild.1

If we have 22 pods running, the 23rd cannot start with the same error:

Warning FailedCreatePodSandBox 4s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "83654e4c84178f6bafd9a9150c93eeb49de21c4bc03fd675d88a31c82b982b26": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container

Logs from ipam:

{"level":"error","ts":"2024-07-25T21:17:01.523Z","caller":"datastore/data_store.go:607","msg":"DataStore has no available IP/Prefix addresses"}
{"level":"info","ts":"2024-07-25T21:17:01.523Z","caller":"rpc/rpc.pb.go:863","msg":"Send AddNetworkReply: IPv4Addr: , IPv6Addr: , DeviceNumber: -1, err: AssignPodIPv4Address: no available IP/Prefix addresses"}
{"level":"info","ts":"2024-07-25T21:17:01.544Z","caller":"rpc/rpc.pb.go:881","msg":"Received DelNetwork for Sandbox eb144583b3a9c07a29e909a932fcf09deb0eb9caaaeb7e5df081551f79cc0433"}
{"level":"debug","ts":"2024-07-25T21:17:01.544Z","caller":"rpc/rpc.pb.go:881","msg":"DelNetworkRequest: K8S_POD_NAME:\"coredns-58488c5db-c4ssp\" K8S_POD_NAMESPACE:\"kube-system\" K8S_POD_INFRA_CONTAINER_ID:\"eb144583b3a9c07a29e909a932fcf09deb0eb9caaaeb7e5df081551f79cc0433\" Reason:\"PodDeleted\" ContainerID:\"eb144583b3a9c07a29e909a932fcf09deb0eb9caaaeb7e5df081551f79cc0433\" IfName:\"eth0\" NetworkName:\"aws-cni\""}
{"level":"debug","ts":"2024-07-25T21:17:01.544Z","caller":"ipamd/rpc_handler.go:261","msg":"UnassignPodIPAddress: IP address pool stats: total 3, assigned 3, sandbox aws-cni/eb144583b3a9c07a29e909a932fcf09deb0eb9caaaeb7e5df081551f79cc0433/eth0"}
{"level":"debug","ts":"2024-07-25T21:17:01.544Z","caller":"ipamd/rpc_handler.go:261","msg":"UnassignPodIPAddress: Failed to find IPAM entry under full key, trying CRI-migrated version"}
{"level":"warn","ts":"2024-07-25T21:17:01.544Z","caller":"ipamd/rpc_handler.go:261","msg":"UnassignPodIPAddress: Failed to find sandbox _migrated-from-cri/eb144583b3a9c07a29e909a932fcf09deb0eb9caaaeb7e5df081551f79cc0433/unknown"}
{"level":"info","ts":"2024-07-25T21:17:01.544Z","caller":"rpc/rpc.pb.go:881","msg":"Send DelNetworkReply: IPv4Addr: , IPv6Addr: , DeviceNumber: 0, err: datastore: unknown pod"}
{"level":"debug","ts":"2024-07-25T21:17:01.868Z","caller":"ipamd/ipamd.go:673","msg":"IP pool is too low: available (0) < ENI target (1) * addrsPerENI (3)"}

According to this document, each instance should accommodate 8 pods per node. However, another page uses the formula min((N * (M - 1)), meaning that only 6 IPs could be used (although I believe it should be (N*M)-1 since 1 IP is allocated to the node, and not 1 IP per ENI).

This issue started when we migrated the cluster from K8s version 1.28 to 1.30 and bumped the CNI version from 1.14.1 to 1.18.2.

@frankh
Copy link

frankh commented Jul 31, 2024

we are having the same issue

pods get stuck due to all the network interfaces having the maximum numbers of IPs. The CNI plugin should not allow pods to be scheduled on nodes that don't have any ip capacity (k8s 1.30, CNI 1.18.2)

@emcay
Copy link

emcay commented Aug 2, 2024

Same issue on 1.29 and 1.18.2. Would be nice if the plugin would not try to schedule where there is no or very little IP capacity.

@philipg
Copy link

philipg commented Aug 6, 2024

seeing the same thing. anyone know what causes it or has a fix?

@frankh
Copy link

frankh commented Aug 8, 2024

i fixed it by setting ENABLE_PREFIX_DELEGATION=true in the cni addon config (this means it allocates blocks of IPs instead of individual IPs per pods)

@edblake00
Copy link

Same issue on 1.29 and 1.18.3.

@orsenthil
Copy link
Member

The original issue here was

{"level":"error","ts":"2024-07-05T04:31:02.334Z","caller":"datastore/data_store.go:652","msg":"DataStore has no available IP/Prefix addresses"}

It means, the not enough IP was available the node.

  • ENABLE_PREFIX_DELEGATION=true is a way to resolve this.

For others who are experiencing, this this occur after any upgrade?
If there is a pattern for reproducing this, could you collect the CNI logs as mentioned in this doc - https://docs.aws.amazon.com/eks/latest/userguide/troubleshooting.html#troubleshoot-cni and send it to k8s-awscni-triage@amazon.com

@orsenthil
Copy link
Member

Between the VPC CNI 1.14.x and later versions, there have changes to reduce the number of EC2 API calls (#2640) that sometimes inadvertently interfered with the previous behavior.

Using the proper values for WARM_IP_TARGET and MINIMUM_IP_TARGET as per this doc - https://github.com/aws/amazon-vpc-cni-k8s/blob/master/docs/eni-and-ip-target.md can help avoid any ip exhausion issue and container creation issue due to ip unavailability too.

@kylos101
Copy link

Same issue on 1.29 and 1.18.2. Would be nice if the plugin would not try to schedule where there is no or very little IP capacity.

@emcay I think aws/containers-roadmap#2189 is related to what you are describing.

@bciaraldi
Copy link

We had this issue come back after upgrading recently. We're now k8s 1.31, cni 1.19.0 (previously k8s 1.30, cni 1.18.2). The only non-default configuration we run is

        - name: ENABLE_POD_ENI
          value: "true"
        - name: WARM_ENI_TARGET
          value: "0"`

If there is a pattern for reproducing this, could you collect the CNI logs as mentioned in this doc - docs.aws.amazon.com/eks/latest/userguide/troubleshooting.html#troubleshoot-cni and send it to k8s-awscni-triage@amazon.com

We're keeping an eye out and will do

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

10 participants