-
Notifications
You must be signed in to change notification settings - Fork 752
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VPC CNI stuck in crash loop without insights #2945
Comments
|
@orsenthil thx for the clarification. update: however, I did get your point about the log verbosity setting is affecting the would you recommend setting |
I was to reproduce this on an EKS node that I have SSH access to and the results have been sent to k8s-awscni-triage@amazon.com after going through as this comment is written, the node has been in this state for 19m (together with many nodes that are stuck for 40mins):
if it helps: I guess the "issue" I have is: the |
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days |
update on this: I don't think this issue is fixable and as suggested, configuring however, it would be helpful if |
I think I had the same issue several times in EKS 1.29 with latest vpc cni (and previous versions too), any idea?
|
is it for sure that ipvs mode can fix this ? or it is only a guess ? |
@dejwsz what's the number of still, I think vpc cni should be more verbose about lock contention to be more self-contained. |
Also regarding vpc-cni not logging about iptable lock busy, core issue is with coreos-iptable by default use wait option where it will wait indefinitely or until timeout.Package Code Link. We can generate error like
|
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days |
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days |
What happened:
I'm working on testing autoscaling for my EKS cluster (1.29) and
karpenter
is frequently scaling nodes up and down during my test.At a certain point, all newly launched nodes stuck in
NotReady
due to VPC CNI pod stuck in crash loop.The symptom is very similar to hitting EC2/ENI API rate limit, however, I can't find out useful logs / metrics from the client (VPC CNI pod) to help me confirm/diagnose, despite
AWS_VPC_K8S_CNI_LOGLEVEL
is set toDEBUG
(AWS_VPC_K8S_PLUGIN_LOG_LEVEL
is alsoDEBUG
if it matters)The version i'm using is
v1.18.1-eksbuild.3
(the EKS optimized addon) and the logs are attached below.Attach logs
unable to run
sudo bash /opt/cni/bin/aws-cni-support.sh
: the image602401143452.dkr.ecr.us-east-1.amazonaws.com/amazon-k8s-cni:v1.18.1-eksbuild.3
seems to bedistroless
logs from
kubectl logs -f ...
:successfully launched
aws-node
pod (AWS_VPC_K8S_CNI_LOGLEVEL=DEBUG
):stuck
aws-node
pod (alsoAWS_VPC_K8S_CNI_LOGLEVEL=DEBUG
):What you expected to happen:
with
AWS_VPC_K8S_CNI_LOGLEVEL=DEBUG
, logs should be more verbose, spitting out information about what the process is doing.whatever error/exception that caused initialization failure should be surfaced to the log stream under pretty much any log level (should be ERROR log level for these entries)
if there are exponential backoff retry for 429 responses, it needs to be surfaced during verbose mode (debug log level)
How to reproduce it (as minimally and precisely as possible):
EKS@1.29
and install VPC CNIv1.18.1-eksbuild.3
from EKS addonsready
) nodes in batches (batch of 15 instances, every 10mins)aws-node
pod stuck in crash loopAnything else we need to know?:
Environment:
kubectl version
):v1.18.1-eksbuild.3
cat /etc/os-release
):uname -a
):The text was updated successfully, but these errors were encountered: