Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataStore has no available IP/Prefix addresses #2669

Closed
doryer opened this issue Nov 16, 2023 · 18 comments
Closed

DataStore has no available IP/Prefix addresses #2669

doryer opened this issue Nov 16, 2023 · 18 comments

Comments

@doryer
Copy link

doryer commented Nov 16, 2023

What happened:
There are error logs that we see also in the old version and new versions of the CNI, indicating that the sandbox DataStore has no available IP/Prefix addresses. From looking on the nodes, seems like those are pretty empty nodes ( 10 Pods for example ) which shouldn't have issue on assigning and also we see it pretty wide across all aws-node pods.

We’re not sure, even after reading the source code, if there’s an impact on the CNI performance, nor if it causes it issues with allocating IPv4 addresses to pods.

Attach logs
ERROR logs:
{"level":"error","ts":"2023-11-16T10:05:03.920Z","caller":"datastore/data_store.go:682","msg":"DataStore has no available IP/Prefix addresses"}
WARN logs that being output near the time of the ERROR logs:
{"level":"warn","ts":"2023-11-16T10:06:18.306Z","caller":"ipamd/rpc_handler.go:230","msg":"UnassignPodIPAddress: Failed to find sandbox _migrated-from-cri/83faffebdfb826372e05d2bc67b6a3e8c40dd4b07dd034fafad9f3838aafa985/unknown"}

Which also after reading the relvant code we're not sure why we see the migration_from_cri and unknown fields.

What you expected to happen:

aws-node pods should be assigning IPv4 addresses properly. without errors.

How to reproduce it (as minimally and precisely as possible):

We're not really sure as well

Environment:

  • Kubernetes version (use kubectl version): v1.25.15
  • CNI Version: v1.15.1
  • OS (e.g: cat /etc/os-release): Ubuntu 20.04.6 LTS
  • Kernel (e.g. uname -a): 5.15.0-1037-aws
@jdn5126
Copy link
Contributor

jdn5126 commented Nov 16, 2023

@doryer those logs are not out of the ordinary if the node is scaling up and attaching new ENIs. Are you able to schedule pods on this node? Does the subnet that this node was deployed in have available IPv4 addresses?

Essentially, the IPAM daemon attaches ENIs to the node as demand and configuration (minimum/warm targets) requires. If no IPs are available then either:

  1. The instance has reached its ENI limit (no more ENIs can be attached) and all IPs assigned to the node are in use.
  2. The subnet is out of available IPv4 addresses, so no new ENIs can be attached or IPs assigned.
  3. There is a configuration error preventing ENIs from being attached.

@doryer
Copy link
Author

doryer commented Nov 17, 2023

@doryer those logs are not out of the ordinary if the node is scaling up and attaching new ENIs. Are you able to schedule pods on this node? Does the subnet that this node was deployed in have available IPv4 addresses?

Essentially, the IPAM daemon attaches ENIs to the node as demand and configuration (minimum/warm targets) requires. If no IPs are available then either:

  1. The instance has reached its ENI limit (no more ENIs can be attached) and all IPs assigned to the node are in use.
  2. The subnet is out of available IPv4 addresses, so no new ENIs can be attached or IPs assigned.
  3. There is a configuration error preventing ENIs from being attached.

Hey @jdn5126 thanks for the quick reply.
So i choose aws-node pod which outputs those errors, this is node with 10 Pods running at him ( including DS pods which is hostNetwork ), the node is from type r5.4xlarge and have 1 ENI attached to the node. also i see since the node joined the cluster he is logging those errors, and pods are still getting scheduled on the node, So:

  1. the instance type have ENI limit of 8, we're at 1 attached.
  2. Subnet have more than 10K available IP addresses, our cluster running in 1 subnet in 1 AZ.
  3. Can you suggest which configuration? i'm pretty lost how to point this error to configuration
    Thanks !

@jdn5126
Copy link
Contributor

jdn5126 commented Nov 17, 2023

Can you collect the node logs from this node and email them to k8s-awscni-triage@amazon.com? I can dig into this further.

To collect the node logs, you can run /opt/cni/bin/aws-cni-support.sh with admin permissions on the node and then email the generated tarball.

@doryer
Copy link
Author

doryer commented Nov 21, 2023

k8s-awscni-triage@amazon.com

No problem, sent it to you

@jdn5126
Copy link
Contributor

jdn5126 commented Nov 21, 2023

@doryer the IP pool cannot be increase on each node because IMDS calls are getting 404s:

2023-11-21T11:00:08.555293408Z stdout F {"level":"warn","ts":"2023-11-21T11:00:08.555Z","caller":"retry/retry.go:70","msg":"Failed to increase pool, error trying to discover attached ENIs on attempt 1/12: get attached ENIs: failed to retrieve ENI metadata for ENI: 0a:f2:4d:c6:fc:4f: EC2MetadataError: failed to make EC2Metadata request\n<?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\n<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\"\n\t\t \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\n<html xmlns=\"http://www.w3.org/1999/xhtml\" xml:lang=\"en\" lang=\"en\">\n <head>\n  <title>404 - Not Found</title>\n </head>\n <body>\n  <h1>404 - Not Found</h1>\n </body>\n</html>\n\n\tstatus code: 404, request id:  "}

Can you confirm that nodes are able to curl IMDS? You can run this from a node (ref: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html):

TOKEN=`curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600"` \
&& curl -H "X-aws-ec2-metadata-token: $TOKEN" -v http://169.254.169.254/latest/meta-data/

I am not sure why nodes would not be able to reach IMDS unless you are explicitly blocking it with some network policy. I also see certificate errors in the node logs trying to curl the API server.

As a side note, what do you have AWS_VPC_K8S_PLUGIN_LOG_FILE set to? I see that AWS_VPC_K8S_CNI_LOG_FILE is set to stderr, which I would not recommend as then you lose historical logs. The lifetime of the logs is the same as the lifetime of the pod.

@doryer
Copy link
Author

doryer commented Nov 22, 2023

Failed to increase pool

Hey, so about IMDS, i've runned the command on the nodes and it does work, otherwise we would have lots of issues so I'm not sure why he suffers from this issue, also we see it on some specific nodes when the rest of the nodes in the cluster are running fine.

Regarding IMDS, we see issues on other nodes in which we get wrong results from IMDS on the machine, we get network interfaces which do not exist anymore in AWS.

About log files, AWS_VPC_K8S_CNI_LOG_FILE is set to stdout and AWS_VPC_K8S_PLUGIN_LOG_FILE is set stderr because it fits to our shippers.

@jdn5126
Copy link
Contributor

jdn5126 commented Nov 22, 2023

@doryer I have seen IMDS out-of-sync issues before, but the VPC CNI periodically syncs with IMDS, so they should resolve within a short amount of time. If the issue persists, that points to an instance-level issue, and we will need help from the EC2 team to investigate. Can you file a support ticket from your AWS console? I think that's the best next step to get resolution

@doryer
Copy link
Author

doryer commented Nov 29, 2023

@doryer I have seen IMDS out-of-sync issues before, but the VPC CNI periodically syncs with IMDS, so they should resolve within a short amount of time. If the issue persists, that points to an instance-level issue, and we will need help from the EC2 team to investigate. Can you file a support ticket from your AWS console? I think that's the best next step to get resolution

So we had a few nodes where the IMDs were not synced (I don't know how is it possible as well ). we terminated them.
But we keep getting logged with those errors, I've sshed to the machine and ran the IMDs metadata command and got a valid response with the rights ENI attached to the machine, so I'm not sure where is the out-of-sync problem that you see in the logs. Any idea what else to check?

BTW, another thing I see, that i see increase in the awscni_aws_api_error_count metric on the api: DeleteNetworkInterface with the error of: InvalidParameterValue ( labels from the metric ).
But by looking on the logs, i see only warning logs which similliar to the warning logs i've attached in the main message
{"level":"warn","ts":"2023-11-29T12:25:09.528Z","caller":"ipamd/rpc_handler.go:230","msg":"UnassignPodIPAddress: Failed to find sandbox _migrated-from-cri/8d7a0a5acc2dd0c231fa0eba6a9c9530ea4e626350ac4921410842140f911ee4/unknown"}

@jdn5126
Copy link
Contributor

jdn5126 commented Nov 29, 2023

Hmm... nothing immediately comes to mind. Can you open an AWS support case for this issue? That will allow us to dig deeper into the instances themselves. More progress can be made through that channel than on this thread.

Copy link

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

@github-actions github-actions bot added the stale Issue or PR is stale label Jan 29, 2024
@abstrask
Copy link

/not stale

@abstrask
Copy link

/remove-lifecycle stale

@jdn5126
Copy link
Contributor

jdn5126 commented Jan 29, 2024

Closing this issue, as its resolution will need to go through the support case

@jdn5126 jdn5126 closed this as completed Jan 29, 2024
Copy link

This issue is now closed. Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.

@creinheimer
Copy link

Hi @jdn5126 / @doryer. What was the support case advice?

I am experiencing the same problem after enabling enforcing mode strict.

CNI plugin: amazon-k8s-cni:v1.18.1
EKS: v1.29.4-eks-036c24b

{"level":"error","ts":"2024-07-15T15:40:51.549Z","caller":"datastore/data_store.go:607","msg":"DataStore has no available IP/Prefix addresses"}
{"level":"info","ts":"2024-07-15T15:40:51.549Z","caller":"rpc/rpc.pb.go:863","msg":"Send AddNetworkReply: IPv4Addr: , IPv6Addr: , DeviceNumber: -1, err: AssignPodIPv4Address: no available IP/Prefix addresses"}
{"level":"info","ts":"2024-07-15T15:40:51.579Z","caller":"rpc/rpc.pb.go:881","msg":"Received DelNetwork for Sandbox 5585d49ac44cc8c83a5fcc586950ab0c4c6b428f7d6646c21a6d2c8aad1d8537"}
{"level":"warn","ts":"2024-07-15T15:40:51.579Z","caller":"ipamd/rpc_handler.go:261","msg":"UnassignPodIPAddress: Failed to find sandbox _migrated-from-cri/5585d49ac44cc8c83a5fcc586950ab0c4c6b428f7d6646c21a6d2c8aad1d8537/unknown"}
{"level":"info","ts":"2024-07-15T15:40:51.585Z","caller":"rpc/rpc.pb.go:881","msg":"Received DelNetwork for Sandbox 17e4fa83d673a3218027f5b3f951d4aa65775202468be796c8e5303887aab91f"}
{"level":"warn","ts":"2024-07-15T15:40:51.585Z","caller":"ipamd/rpc_handler.go:261","msg":"UnassignPodIPAddress: Failed to find sandbox _migrated-from-cri/17e4fa83d673a3218027f5b3f951d4aa65775202468be796c8e5303887aab91f/unknown"}

Most pods are stuck in ContainerCreating 😞

@orsenthil
Copy link
Member

@creinheimer , please open a new ticket. The error most likely indicates there are no free ip addresses available for your ENI.

@creinheimer
Copy link

It works well if network policies are disabled. I am experiencing the exact same problem mentioned in this issue. That's why I am curious about the AWS support team's advice.

@doryer
Copy link
Author

doryer commented Dec 24, 2024

It works well if network policies are disabled. I am experiencing the exact same problem mentioned in this issue. That's why I am curious about the AWS support team's advice.

We're running on v1.15.1 with --enable-network-policy=false and still facing this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants