The following topics are covered in this chapter:
- Common Problems
- Deployment Problem
- ACL Issues
- Full backup
- Velero
- Prometheus Alerts
This section provides detailed troubleshooting procedures for the Consul Service.
If you face any problem with Consul Service refer to the Official Troubleshooting Guide.
This problem can be detected on the Consul monitoring dashboard. If the current cluster's size goes down, then it means that one of the servers most likely has crashed.
Consul cluster will be temporary unable to process requests until new leader is elected. Leader election is performed automatically by Consul cluster. More information about leader election in Consul you can find at Consensus article.
As a solution, reboot failed Consul server pod.
A Down
status on monitoring indicates that all the Consul servers have failed.
As a solution, restart all Consul server pods.
Consul request processing may be impacted up to potential Consul server failure when CPU consumption reaches resource limit for particular Consul server.
As a solution, increase CPU requests and limits for Consul server.
For more information, see Consul CPU Overload.
Consul request processing may be impacted up to potential Consul server failure when memory consumption reaches resource limit for particular Consul server.
As a solution, increase memory requests and limits or scale out the cluster.
For more information, see Consul Memory Limit.
Consul becomes non-operational when disk capacity on a server runs out due to high volume of KVs.
For more information, see Consul Disk Filled on All Nodes.
The Consul cluster is in an invalid state, you are not able to send requests to Consul and logs of the Consul server(s) have the following errors:
2020/08/13 07:16:58 [ERR] agent: failed to sync remote state: No cluster leader
2020/08/13 07:17:04 [WARN] raft: Election timeout reached, restarting election
2020/08/13 07:17:04 [INFO] raft: Node at 10.130.169.164:8300 [Candidate] entering Candidate state in term 169
2020/08/13 07:17:07 [ERR] http: Request GET /v1/operator/autopilot/health, error: No cluster leader from=127.0.0.1:43164
2020/08/13 07:17:07 [ERR] agent: Coordinate update error: No cluster leader
It means that Consul cannot form a quorum to elect a single node to be the leader. A quorum is
a majority of members from peer set: for a set of size n
, quorum requires at least
(n+1)/2
members. For example, if there are 3 members in the peer set, we would need 2 nodes to form
a quorum.
The reason is that most of the Consul server nodes are unavailable. The possible reasons, that the Consul server failed, are:
- The disk is out of space.
- The Consul server is out of cluster due to long absence. For example, because of OpenShift/Kubernetes node failure.
- etc.
To solve the problem, look through logs from all Consul server pods and identify the failed ones
(errors different from No cluster leader
and Election timeout reached, restarting election
). Fix
identified problems using the other articles of this guide.
If the Consul cluster is unavailable after fixing all problems, restart the Consul: edit the
StatefulSet consul-service-consul-server
, set replicas
to 0
and wait until all pods are scaled down.
Then return replicas
to the previous value, for example 3
.
If the Consul Client pod starts with Node ID already gathered by another Client, it shows warning:
[WARN] agent.fsm: EnsureRegistration failed: error="failed inserting node: Error while renaming Node ID: "1f0a9dbd-655c-e7d8-c934-6f4d5be42491": Node name node-1 is reserved by node c11fbfa2-3144-01fb-44b1-b4ded071e4a7 with name node-1"
In this case services can not register in Consul. It can happen in case of manual cluster scaling (it should be done via Rolling Update job) or after failed de-registration.
To resolve such issue it is recommended to apply --disable-host-node-id=true
flag to Consul Client
DaemonSet. It can be specified in client.extraConfig
parameter:
extra-from-values.json: '{"disable_update_check":true, "disable_host_node_id":true}'
Also, the stored Node ID can be discarded by removing the /data/node-id
file manually in Client pod terminal.
In both cases, restart the affected Client pods to apply changes and confirm that unique node id is applied
for every Consul Client.
The problem occurs because of limit on concurrent TCP connections a single client IP address is allowed to open to the agent's
HTTP(S) server. The default value for this parameter is 200
.
For more details, refer to Agents Configuration File Reference.
There are two ways to increase the limit value:
-
In properties during upgrade by Deployer:
server: extraConfig: { "limits": { "http_max_conns_per_client": <new_value> } }
where
<new_value>
is an increased value of limit. For example,500
. -
In
consul-server-config
configMap
in Consul namespace add toextra-from-values.json
keyhttp_max_conns_per_client
parameter. For example,extra-from-values.json: '{"disable_update_check":true,"limits":{"http_max_conns_per_client":<new_value>}}'
Don't forget to restart Consul servers to apply made changes.
This section provides information about the issues you may face during deployment of Consul Service.
When you deploy Consul Service, the Consul Server pod does not start. The following are the list of causes:
- Monitoring
- Consul starts on incorrect nodes
- Consul client pod does not start
- Pod failed with Consul sidecar container
Consul Monitoring is a Telegraf Agent deployed as a sidecar container on Consul Servers. If the Consul Servers does not start, the problem can be with Consul Monitoring and you need to check the logs of its container within a Consul Server pod.
The most common cause is the incorrect value in the smDbHost
parameter. Its value should be a valid address of a data series database.
If you use predefined Persistent Volumes and specify affinity to bind Consul's pods with nodes you may face issues with Consul pods starting on incorrect nodes. There are many causes, but most commonly it happens because Consul uses the StatefulSet to deploy pods which cannot guarantee Consul pods assign to specific nodes, you can only deploy via preferred affinity. To avoid issues with incorrectly assigned nodes, you must split deployment of Consul Server and other components. For more information, refer to the Predefined Persistent Volumes section in the Consul Service Installation Procedure.
Sometimes the Consul Server also starts on incorrect nodes with correct affinity.
It can depend on the state of Kubernetes nodes and cluster.
For example one of Consul's pod could get stuck and skip its turn to bind to node.
You need to edit the StatefulSet consul-service-consul-server
and to set replicas
to 0
and wait until all pods are scaled down.
Then you need to return replicas
to previous value, for example 3
.
Perform the above steps until the Consul's pods are assigned to the right nodes.
In case, you deploy Consul with ACL you cannot change the replicas
number during installation.
You need to re-install the Consul until it is assigned on right PV.
You can use a workaround and create host-path
folders for Consul's pods on each Kubernetes node.
If you deploy Consul Service with enabled client
, you need to make sure your Kubernetes nodes have free port 8502
.
Clients use this port for grpc
to connect with client services.
Some pods with enabled connect-inject
cannot start with errors in sidecar container.
Check if is there any problem with Consul Client on the Kubernetes node.
If there is an issue, resolve it and then re-deploy your service.
Also, you need to check the deployment parameters. Verify the connect-inject
has the following parameters:
connectInject:
enabled: true
client:
enabled: true
grpc: true
Not all types of NFS storage are compatible with Consul.
Using NFS can lead to several problems, and also it may be the cause of performance degradation.
If you have the following problem when Consul pods start, it indicates that the NFS works in "On demand" mode and does not allow to read file until operating system scans it.
==> Error starting agent: Failed to start Consul server: Failed to start Raft: open /consul/data/raft/raft.db: invalid argument
You need to resolve the issue with NFS configuration or start using local storage instead of NFS.
The following section describe the ACL issues and its solutions.
If you encounter issues that are unresolvable, or misplace (lose) the bootstrap token, you can reset the ACL system by updating the index.
You can see this issue with error on ACL init pod:
2020-04-01T19:54:47.778Z [ERROR] ACLs already bootstrapped but the ACL token was not written to a Kubernetes secret. We can't proceed because the bootstrap token is lost. You must reset ACLs.
Find the leader using /v1/status/leader
endpoint on any node of Consul in terminal. ACL reset must be performed on the leader.
$ curl localhost:8500/v1/status/leader
"172.17.0.3:8300"%
In the above example, you can see that the leader is at IP 172.17.0.3
. Run the following commands on that server.
Re-run the bootstrap command to get the index number.
$ consul acl bootstrap
Failed ACL bootstrapping: Unexpected response code: 403 (Permission denied: ACL bootstrap no longer allowed (reset index: 13))
Write the reset index into the bootstrap reset file. For example, here the reset index is 13.
echo 13 >> /consul/data/acl-bootstrap-reset
After resetting the ACL system, you can recreate the bootstrap token or re-install Consul Cluster.
For example, you have already installed Consul in 2 datacenters and try to perform full backup, but you see the following error in response to collecting a Consul backup:
{
"is_granular": false,
"db_list": "full backup",
"id": "20201211T152323",
"failed": true,
"locked": false,
"sharded": false,
"ts": 1607700203000,
"exit_code": 1,
"spent_time": "2525ms",
"size": "9566b",
"exception": "Traceback (most recent call last):\n\n File \"/opt/backup/backup-daemon.py\", line 213, in __perform_backup\n raise BackupProcessException(msg)\n\nBackupProcessException: Last 5 lines of logfile: b'[2020-12-11T15:23:26,339][INFO][category=Backup] Snapshot for datacenter \"dc1\" completed successfully.\\n [2020-12-11T15:23:26,357][ERROR][category=Backup] There is problem with getting snapshot from datacenter: dc2, details: ACL not found\\n'\n",
"valid": false,
"evictable": true
}
It means that the Consul datacenter from the error ("dc2") is not configured to back up data.
As a solution, upgrade problematic Consul with the following parameter:
backupDaemon:
enabled: true
After recovery from a Velero backup, Consul clients or other applications can't work with Consul.
To properly restore Consul from a Velero backup, Consul backup daemon should be installed. If it is not installed, you need to perform the following steps manually to recover Consul:
-
Find service account secret starting with
<CONSUL_FULLNAME>-auth-method
. Saveca.crt
andtoken
from this secret in any text editor. -
Transform found
ca.crt
locally from multi-line to one-line string to use it in the next step. For example, the following certificate:-----BEGIN CERTIFICATE----- MIICyDCCAbCgAwIBAgIBADANBgkqhkiG9w0BAQsFADAVMRMwEQYDVQQDEwprdWJl cm5ldGVzMB4XDTIxMDIwODIyMzEwOVoXDTMxMDIwNjIyMzEwOVowFTETMBEGA1UE AxMKa3ViZXJuZXRlczCCASIwDQYJKoZIhvcNAQEBBQADggEPADCCAQoCggEBAJm4 iV3cacrBc4OMw55fFVATaXioGHZF65iEFfiQz5rni3xVeeKmCJMLuScUEsqny8io HyE7ESxt9/hdXz1smRiNXtMGCVTIYKv2RTtiRP/b1R9DBZUIcdVIMSm5h19lhrL7 sAIMPx1KLQcWRWwaYmFNJBIJi5ZPJVE+UR84g67W8HGIv9EQUZZOrVSd4C0ybgx5 k7Rt99FXCzzEPLh8iq/yzvwV95ctag2Hr1gWbELJcCSt6im8P2X7uQ7mc8izF5xf aXoJtxWOIgQb1BE5okv5IerKUAWbCmsac0nsawfBLWj5CCjssGWTYQj6GK4hgZIm 5KuJccaJippJ1x/+5LkCAwEAAaMjMCEwDgYDVR0PAQH/BAQDAgKkMA8GA1UdEwEB /wQFMAMBAf8wDQYJKoZIhvcNAQELBQADggEBAEZGx+VFVni0bD1gnbIMdbwX3grC 4vlTrj/suvzwlZ6+ff2ygEMb3pjmCprLUXeWC7rVzqxNEmVr0xH3hAGCB57qolck nzxOA8dtIwUotG1oM/8bRvNSqhnxNlQeptiagHd+Zyrux9vV5ZogM76NwPAbbT48 OooOMshWjxG7RHWqKNEG5c8mc7cEBYpM+NdGLbzAcDnYzOL7QlQUrH7dqtFeLb0A u8EF80PYejvrtBdNYEteJkBZSkNAVC1e3HjYO6eA6enyEW3d/6d5HzcOuZyWx7OE Q6SiRG7FfqFgfAmUN9P1+1B1soT7+SxknhebwITr0gkppY2eXyZ7l7Wox8U= -----END CERTIFICATE-----
is to be transformed to the following structure:
-----BEGIN CERTIFICATE-----\nMIICyDCCAbCgAwIBAgIBADANBgkqhkiG9w0BAQsFADAVMRMwEQYDVQQDEwprdWJl\ncm5ldGVzMB4XDTIxMDIwODIyMzEwOVoXDTMxMDIwNjIyMzEwOVowFTETMBEGA1UE\nAxMKa3ViZXJuZXRlczCCASIwDQYJKoZIhvcNAQEBBQADggEPADCCAQoCggEBAJm4\niV3cacrBc4OMw55fFVATaXioGHZF65iEFfiQz5rni3xVeeKmCJMLuScUEsqny8io\nHyE7ESxt9/hdXz1smRiNXtMGCVTIYKv2RTtiRP/b1R9DBZUIcdVIMSm5h19lhrL7\nsAIMPx1KLQcWRWwaYmFNJBIJi5ZPJVE+UR84g67W8HGIv9EQUZZOrVSd4C0ybgx5\nk7Rt99FXCzzEPLh8iq/yzvwV95ctag2Hr1gWbELJcCSt6im8P2X7uQ7mc8izF5xf\naXoJtxWOIgQb1BE5okv5IerKUAWbCmsac0nsawfBLWj5CCjssGWTYQj6GK4hgZIm\n5KuJccaJippJ1x/+5LkCAwEAAaMjMCEwDgYDVR0PAQH/BAQDAgKkMA8GA1UdEwEB\n/wQFMAMBAf8wDQYJKoZIhvcNAQELBQADggEBAEZGx+VFVni0bD1gnbIMdbwX3grC\n4vlTrj/suvzwlZ6+ff2ygEMb3pjmCprLUXeWC7rVzqxNEmVr0xH3hAGCB57qolck\nnzxOA8dtIwUotG1oM/8bRvNSqhnxNlQeptiagHd+Zyrux9vV5ZogM76NwPAbbT48\nOooOMshWjxG7RHWqKNEG5c8mc7cEBYpM+NdGLbzAcDnYzOL7QlQUrH7dqtFeLb0A\nu8EF80PYejvrtBdNYEteJkBZSkNAVC1e3HjYO6eA6enyEW3d/6d5HzcOuZyWx7OE\nQ6SiRG7FfqFgfAmUN9P1+1B1soT7+SxknhebwITr0gkppY2eXyZ7l7Wox8U=\n-----END CERTIFICATE-----
-
Run the following commands to create necessary auth methods from any Consul server pod:
curl -XPUT -H "X-Consul-Token:<CONSUL_BOOTSTRAP_ACL_TOKEN>" -k -H "Accept:application/json" -H "Content-Type:application/json" "<CONSUL_URL>/v1/acl/auth-method/<CONSUL_FULLNAME>-k8s-auth-method" -d' { "Name": "<CONSUL_FULLNAME>-k8s-auth-method", "Description": "Kubernetes Auth Method", "Type": "kubernetes", "Config": { "Host": "https://kubernetes.default.svc", "CACert": "<SA_SECRET_CA_CRT>", "ServiceAccountJWT": "<SA_SECRET_TOKEN>" } }'
curl -XPUT -H "X-Consul-Token:<CONSUL_BOOTSTRAP_ACL_TOKEN>" -k -H "Accept:application/json" -H "Content-Type:application/json" "<CONSUL_URL>/v1/acl/auth-method/<CONSUL_FULLNAME>-k8s-component-auth-method" -d' { "Name": "<CONSUL_FULLNAME>-k8s-component-auth-method", "Description": "Kubernetes Auth Method", "Type": "kubernetes", "Config": { "Host": "https://kubernetes.default.svc", "CACert": "<SA_SECRET_CA_CRT>", "ServiceAccountJWT": "<SA_SECRET_TOKEN>" } }'
Where:
<CONSUL_FULLNAME>
is the fullname of Consul. For example,consul
.<CONSUL_BOOTSTRAP_ACL_TOKEN>
is the Consul bootstrap ACL token that can be found in corresponding secret.<CONSUL_URL>
is the URL for Consul with protocol and port. For example,http://consul-server:8500
.<SA_SECRET_CA_CRT>
is theca.crt
certificate transformed to one-line string from 2 step.<SA_SECRET_TOKEN>
is thetoken
from 1 step.
There are no Consul server pods in namespace.
- Consul Server pod failures or unavailability.
- Resource constraints impacting Consul Server pod performance.
- Consul Server stateful set is scaled down to 0 intentionally or due to incorrect installation.
- Complete unavailability of the Consul cluster.
- Check if the Consul Server pods exist.
- Check if the Consul Server stateful set exists and has at least one desired pod.
- Verify resource utilization of Consul Server pods (CPU, memory).
- Scale-in or redeploy Consul Service pods if stateful set is in failed state.
- Investigate and address any resource constraints affecting the Consul Server pod performance.
Consul cluster is degraded, it means that at least one of the nodes have failed, but cluster is able to work.
For more information refer to Leader Server Failure.
- Consul Server pod failures or unavailability.
- Resource constraints impacting Consul Server pod performance.
- Reduced or disrupted functionality of the Consul cluster.
- Potential impact on processes relying on the Consul.
- Check the status of Consul Server pods.
- Review logs for Consul Server pods for any errors or issues.
- Verify resource utilization of Consul Server pods (CPU, memory).
- Restart or redeploy Consul Server pods if they are in a failed state.
- Investigate and address any resource constraints affecting the Consul Server pod performance.
Consul cluster is down, and there are no available pods.
For more information refer to All Servers Failure.
- Network issues affecting the Consul Server pod communication.
- Consul Server's storage is corrupted.
- Internal error blocks Consul Server cluster working.
- Complete unavailability of the Consul cluster.
- Other processes relying on the Consul cluster will fail.
- Check the status of Consul Server pods.
- Review logs for Consul Server pods for any errors or issues.
- Verify resource utilization of Consul Server pods (CPU, memory).
- Restart or redeploy Consul Server pods if they are in a failed state.
- Investigate and address any resource constraints affecting the Consul Server pod performance.
One of Consul Server pods uses 95% of the CPU limit.
For more information refer to CPU Limit.
- Insufficient CPU resources allocated to Consul Server pods.
- The service is overloaded.
- Increased response time and potential slowdown of Consul requests.
- Degraded performance of services used the Consul.
- Potential Consul server failure when CPU consumption reaches resource limit for particular Consul server.
- Monitor the CPU usage trends in Consul Monitoring dashboard.
- Review Consul Server logs for any performance related issues.
- Try to increase CPU request and CPU limit for Consul Server.
- Scale up Consul cluster as needed.
One of Consul Server pods uses 95% of the memory limit.
For more information refer to Memory Limit.
- Insufficient memory resources allocated to Consul Server pods.
- Service is overloaded.
- Potentially lead to the increase of response times or crashes.
- Degraded performance of services used the Consul.
- Monitor the memory usage trends in Consul Monitoring dashboard.
- Review Consul Server logs for memory related errors.
- Try to increase Memory request and Memory limit for Consul Server.
- Scale up Consul cluster as needed.