-
Notifications
You must be signed in to change notification settings - Fork 326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mesh-gateway fails to restart with peered connections in k8s when replicas > 1 #2509
Comments
hey @christophermichaeljohnston just want to confirm this still needs investigating and that the resolution in your original issue didn't actually resolve it? |
Correct. This is still an issue. I have an environment up that I can gather information from to help with the investigation. |
Okay, I'm going to start investigating on our end to replicate and track down the issue (also going to dig into what you've got here once I've got it replicated). |
hey @christophermichaeljohnston I haven't been able to recreate the issue you're seeing, I've been using this setup which pretty much follows this doc with mesh replicas set to 2. Can you provide a minimal setup that reproduces the issue? |
Hi @jm96441n. Looked over your setup and mine is very similar. The main difference is that I used an aws loadbalancer instead of metallb.
I've also tried using the UI to creating the peering connection instead of the CRDs and the result was the same. So I wonder if this is caused by something with a difference between the aws loadbalancer and metalb. Did the UI in your test look similiar to the screenshot in the original post with the same server name multiple times (or was it addresses)? Does that server name return multiple address?
|
so I setup a new cluster without metallb on eks (I was using metallb for an attempt at a local recreation using kind) and was not able to replicate. I created a peering connection through the mesh gateway with 2 replicas on the gateway, deleted a mesh pod on the establishing side and it came right back up. I do see the same as you do (multiple server addresses) but my understanding is that for each instance of the mesh gateway you'll see an address, but in this case all the mesh instances are behind the same ELB which is why you see the same address multiple times. Can you provide a minimal setup that reproduces the issue that I can run? |
This minimal setup reproduces the mesh gateway restart failure. Note I tried consul 1.16.0 and consul-k8s 1.2.0 with the same result and its what is the minimal setup. |
This continues to be a problem. Even reducing replicas to a single mesh gateway, the consul servers don't think the peered connection is heathy. In the UI the peer state is 'Active' but 'consul.peering.healthy' returns 0. Is there a way to enable debug logs in the mesh gateway to try and get more information on what is happening? |
You can increase the mesh-gateway envoy log-level by either:
or
$ curl -XPOST localhost:19000/logging\?level\=debug
active loggers:
admin: debug
alternate_protocols_cache: debug
aws: debug
assert: debug
# ---- component list cut for brevity ---- The above would change all the envoy component log-levels to |
Thanks. Not sure how I missed logLevel in the helm chart. I've captured logs from when the meshGateway starts to when it stops itself with a
|
Looking through the logs I do see: 2023-09-19T08:42:47-04:00 2023-09-19T12:42:39.723782332Z stderr F 2023-09-19T12:42:39.723Z+00:00 [warning] envoy.config(14) gRPC config for type.googleapis.com/envoy.config.cluster.v3.Cluster rejected: Error adding/updating cluster(s) server.stage-04-use.peering.88780324-49a4-249e-19e0-82c359abf2f5.consul: LOGICAL_DNS clusters must have a single locality_lb_endpoint and a single lb_endpoint
2023-09-19T08:42:47-04:00 2023-09-19T12:42:39.723777771Z stderr F 2023-09-19T12:42:39.723Z+00:00 [warning] envoy.config(14) delta config for type.googleapis.com/envoy.config.cluster.v3.Cluster rejected: Error adding/updating cluster(s) server.stage-04-use.peering.88780324-49a4-249e-19e0-82c359abf2f5.consul: LOGICAL_DNS clusters must have a single locality_lb_endpoint and a single lb_endpoint I see you did attempt to set I'm wondering if the aws loadbalancer is interfering at all. I have seen issues with cross-zone loadbalancing in aws. Have you tried enabling cross-zone load balancing to see if this helps? |
cross-zone load balancing has no impact. Results in the same crash loop with the same suspicious 'LOGICAL_DNS' log message. And consul_peering_healthy is still 0 (unhealthy) even though the peered connection is active. |
I managed to get a working eks reproduction up with no issues when deleting a mesh-gateway. It came right back up no issue. This was tested using I'd recommend looking into the something with regard to your AWS networking or security group permissions. Reproduction InfoVersions:
consul-k8s overrides: global:
name: consul
peering:
enabled: true
tls:
enabled: true
httpsOnly: false
enterpriseLicense:
secretName: license
secretKey: key
enableLicenseAutoload: true
enableConsulNamespaces: true
adminPartitions:
enabled: true
name: "default"
acls:
manageSystemACLs: true
connectInject:
enabled: true
default: true
replicas: 2
consulNamespaces:
mirroringK8S: true
k8sAllowNamespaces: ['*']
k8sDenyNamespaces: []
syncCatalog:
enabled: true
k8sAllowNamespaces: ["*"]
consulNamespaces:
mirroringK8S: true
meshGateway:
enabled: true
replicas: 3
service:
type: LoadBalancer
annotations: |
"service.beta.kubernetes.io/aws-load-balancer-internal": "true"
server:
enabled: true
replicas: 3
extraConfig: |
{
"performance": {
"raft_multiplier": 3
},
"telemetry": {
"disable_hostname": true
}
}
ui:
enabled: true
service:
type: LoadBalancer
vpc module terraform code
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "5.1.2"
name = local.name
cidr = var.eks-vpc.vpc
azs = local.azs
private_subnets = [for k, v in local.azs : cidrsubnet(var.eks-vpc.vpc, 4, k)]
public_subnets = [for k, v in local.azs : cidrsubnet(var.eks-vpc.vpc, 8, k + 48)]
enable_nat_gateway = true
single_nat_gateway = true
enable_dns_hostnames = true # required for eks, default is false
reuse_nat_ips = true
external_nat_ip_ids = aws_eip.nat.*.id
map_public_ip_on_launch = true # now required as of 04-2020 for EKS Nodes
public_subnet_tags = {
"kubernetes.io/cluster/${local.name}" = "shared"
"kubernetes.io/role/elb" = 1
}
private_subnet_tags = {
"kubernetes.io/cluster/${local.name}" = "shared"
"kubernetes.io/role/internal-elb" = 1
}
} eks module terraform code module "eks" {
source = "terraform-aws-modules/eks/aws"
version = "19.5.1"
cluster_name = local.name
cluster_version = var.k8s-version
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.public_subnets
cluster_endpoint_public_access = true
eks_managed_node_group_defaults = {
ami_type = "AL2_x86_64"
}
eks_managed_node_groups = {
consul = {
name = "consul"
instance_types = [var.eks-node-instance-type]
min_size = 1
max_size = 5
desired_size = 3
}
}
node_security_group_additional_rules = {
ingress_self_all = {
description = "Node to node all ports/protocols"
protocol = "-1"
from_port = 0
to_port = 0
type = "ingress"
self = true
}
ingress_cluster_all = {
description = "Cluster to node all ports/protocols"
protocol = "-1"
from_port = 0
to_port = 0
type = "ingress"
source_cluster_security_group = true
}
egress_all = {
description = "Node all egress"
protocol = "-1"
from_port = 0
to_port = 0
type = "egress"
cidr_blocks = ["0.0.0.0/0"]
ipv6_cidr_blocks = ["::/0"]
}
}
} |
I've tried peering clusters in different regions and even within the same region in the same vpc with the same subnets with wide open SGs. The issue remains. Once clusters are peered, any mesh-gateway restart on the dialing side results in a crash loop. All I can determine from the logs is that the dataplane never fully initializes so it terminates itself. And even with only a single mesh-gateway, the prometheus metrics never indicate that peering is healthy, which is perhaps the cause of the crash loop with multiple mesh-gateway. I can't determine from the logs what part of the dataplane is having issues. |
I did manually create the clusters so perhaps there is something there... I'll try using the provided terraform instead. |
No luck with those additional SG rules. :( I've updated my consul_mesh_test repo with everything I've used. This test was 2 eks clusters in the same vpc, using the same subnets. I've also captured mesh gateway logs in 'trace'. Logs are similar in that 'Envoy is not fully initialized' which seems to trigger 'consul-dataplane.lifecycle: initiating shutdown'. But the trace level does have 'envoy.connection(13) [C0] read error: Resource temporarily unavailable'. Could this be the cause? What resource is unavailable? |
#19268 looks to be a possible solution to this problem |
Closing as hashicorp/consul#19268 does fix the issue. This has been isolated to AWS EKS environments and should go out with the next set of Consul patch releases. |
I am facing this issue in AKS as well, where LoadBalancer doesn't have hostnames. I tried using public and private LBs.
P.S. I am evaluating Consul multi-cluster federation using this tutorial https://developer.hashicorp.com/consul/tutorials/kubernetes/kubernetes-mesh-gateways Here's my Helm values for the second cluster:
|
Community Note
Overview of the Issue
consul: 1.15.3
consul-k8s-control-plane: 1.1.2
When running the mesh-gateway in k8s with replicas > 1, they fail to restart (ie after a pod failure) on the establishing side of
an active peered connection. This looks to be caused by envoy rejecting the additional endpoints causing the mesh-gateway to eventually terminate with k8s attempting to restart it (restart loop). The only way to restore service is to reduce the number of mesh-gateway replicas to 1 on both sides of the peered connection.
Reproduction Steps
Logs
Logs from mesh-gateway:
Logs from consul server:
Expected behavior
Should be able to run the mesh gateway component with more than 1 replica.
Environment details
AWS K8S 1.26
Additional Context
Created this in the consul issue tracker (hashicorp/consul#17557) but will close that as this seems to be the better location for this issue.
Screen shot showing peering status on the establishing side (the dialer side doesn't include the server addresses in the ui)
I wonder if this has anything to do with: envoyproxy/envoy#14848
The text was updated successfully, but these errors were encountered: