title | summary | aliases | ||
---|---|---|---|---|
TiDB Troubleshooting Map |
Learn how to troubleshoot common errors in TiDB. |
|
This document summarizes common issues in TiDB and other components. You can use this map to diagnose and solve issues when you encounter related problems.
-
1.1.1 The
Region is Unavailable
error is usually because a Region is not available for a period of time. You might encounterTiKV server is busy
, or the request to TiKV fails due tonot leader
orepoch not match
, or the request to TiKV time out. In such cases, TiDB performs abackoff
retry mechanism. When thebackoff
exceeds a threshold (20s by default), the error will be sent to the client. Within thebackoff
threshold, this error is not visible to the client. -
1.1.2 Multiple TiKV instances are OOM at the same time, which causes no Leader in a Region for a period of time. See case-991 in Chinese.
-
1.1.3 TiKV reports
TiKV server is busy
, and exceeds thebackoff
time. For more details, refer to 4.3.TiKV server is busy
is a result of the internal flow control mechanism and should not be counted in thebackoff
time. This issue will be fixed. -
1.1.4 Multiple TiKV instances failed to start, which causes no Leader in a Region. When multiple TiKV instances are deployed in a physical machine, the failure of the physical machine can cause no Leader in a Region if the label is not properly configured. See case-228 in Chinese.
-
1.1.5 When a Follower apply is lagged in a previous epoch, after the Follower becomes a Leader, it rejects the request with
epoch not match
. See case-958 in Chinese (TiKV needs to optimize its mechanism).
Refer to 5 PD issues.
- 2.1.1 Wrong TiDB execution plan causes latency increase. Refer to 3.3.
- 2.1.2 PD Leader election issue or OOM. Refer to 5.2 and 5.3.
- 2.1.3 A significant number of Leader drops in some TiKV instances. Refer to 4.4.
-
2.2.1 TiKV single thread bottleneck
-
Too many Regions in a TiKV instance causes a single gRPC thread to be the bottleneck (Check the Grafana -> TiKV-details -> Thread CPU/gRPC CPU Per Thread metric). In v3.x or later versions, you can enable
Hibernate Region
to resolve the issue. See case-612 in Chinese. -
For versions earlier than v3.0, when the raftstore thread or the apply thread becomes the bottleneck (Grafana -> TiKV-details -> Thread CPU/raft store CPU and Async apply CPU metrics exceed
80%
), you can scale out TiKV (v2.x) instances or upgrade to v3.x with multi-threading.
-
-
2.2.2 CPU load increases.
-
2.2.3 TiKV slow write. Refer to 4.5.
-
2.2.4 TiDB wrong execution plan. Refer to 3.3.
-
3.1.1 An error
ERROR 1105 (HY000): unsupported modify decimal column precision
is reported when you modify the length of thedecimal
field. TiDB does not support changing the length of thedecimal
field. -
3.1.2 TiDB DDL job hangs or executes slowly (use
admin show ddl jobs
to check DDL progress)-
Cause 1:Network issue with other components (PD/TiKV).
-
Cause 2:Early versions of TiDB (earlier than v3.0.8) have heavy internal load because of a lot of goroutine at high concurrency.
-
Cause 3:In early versions (v2.1.15 & versions < v3.0.0-rc1), PD instances fail to delete TiDB keys, which causes every DDL change to wait for two leases.
-
For other unknown causes, report a bug.
-
Solution:
- For cause 1, check the network connection between TiDB and TiKV/PD.
- For cause 2 and 3, the issues are already fixed in later versions. You can upgrade TiDB to a later version.
- For other causes, you can use the following solution of migrating the DDL owner.
-
DDL owner migration:
- If you can connect to the TiDB server, execute the owner election command again:
curl -X POST http://{TiDBIP}:10080/ddl/owner/resign
- If you cannot connect to the TiDB server, use
tidb-ctl
to delete the DDL owner from the etcd of the PD cluster to trigger re-election:tidb-ctl etcd delowner [LeaseID] [flags] + ownerKey
- If you can connect to the TiDB server, execute the owner election command again:
-
-
3.1.3 TiDB reports
information schema is changed
error in log-
Cause 1: The DML operation touches a table that is under DDL. You can use
admin show ddl job
to check the DDLs that are currently in progress. -
Cause 2: The current DML operation is executed too long. During the time, many DDL operations are executed, which causes
schema version
changes to be more than 1024. The new versionlock table
might also cause schema version changes. -
Cause 3: The TiDB instance that is currently executing DML statements cannot load the new
schema information
(maybe caused by network issues with PD or TiKV). During this time, many DDL statements are executed (includinglock table
), which causesschema version
changes to be more than 1024. -
Solution:The first two causes do not impact the application, as the related DML operations retry after failure. For cause 3, you need to check the network between TiDB and TiKV/PD.
-
Background: The increased number of
schema version
is consistent with the number ofschema state
of each DDL change operation. For example, thecreate table
operation has 1 version change, and theadd column
operation has 4 version changes. Therefore, too many column change operations might causeschema version
to increase fast. For details, refer to online schema change.
-
-
3.1.4 TiDB reports
information schema is out of date
in log-
Cause 1:The TiDB server that is executing the DML statement is stopped by
graceful kill
and prepares to exit. The execution time of the transaction that contains the DML statement exceeds one DDL lease. An error is reported when the transaction is committed. -
Cause 2: The TiDB server cannot connect to PD or TiKV when it is executing the DML statement, which causes the following problems:
- The TiDB server did not load the new schema within one DDL lease (
45s
by default); or - The TiDB server disconnects from PD with the
keep alive
setting.
- The TiDB server did not load the new schema within one DDL lease (
-
Cause 3: TiKV has high load or network timed out. Check the node loads in Grafana -> TiDB and TiKV.
-
Solution:
- For cause 1, retry the DML operation when TiDB is started.
- For cause 2, check the network between the TiDB server and PD/TiKV.
- For cause 3, investigate why TiKV is busy. Refer to 4 TiKV issues.
-
-
3.2.1 Symptom
-
Client: The client reports the error
ERROR 2013 (HY000): Lost connection to MySQL server during query
. -
Check the log
-
Execute
dmesg -T | grep tidb-server
. The result shows the OOM-killer log around the time point when the error occurs. -
Grep the "Welcome to TiDB" log in
tidb.log
around the time point after the error occurs (namely, the time when tidb-server restarts). -
Grep
fatal error: runtime: out of memory
orcannot allocate memory
intidb_stderr.log
. -
In v2.1.8 or earlier versions, you can grep
fatal error: stack overflow
in thetidb_stderr.log
.
-
-
Monitor:The memory usage of tidb-server instances increases sharply in a short period of time.
-
-
3.2.2 Locate the SQL statement that causes OOM. (Currently all versions of TiDB cannot locate SQL accurately. You still need to analyze whether OOM is caused by the SQL statement after you locate one.)
- For versions >= v3.0.0, grep “expensive_query” in
tidb.log
. That log message records SQL queries that timed out or exceed memory quota. - For versions < v3.0.0, grep “memory exceeds quota” in
tidb.log
to locate SQL queries that exceed memory quota.
Note:
The default threshold for a single SQL memory usage is
1GB
(in bytes, scope:SESSION
). You can set this parameter by configuringtidb_mem_quota_query
. You can also modify themem-quota-query
item (in bytes) in the configuration file by hot loading the configuration items. - For versions >= v3.0.0, grep “expensive_query” in
-
3.2.3 Mitigate OOM issues
- By enabling
SWAP
, you can mitigate the OOM issue caused by overuse of memory by large queries. When the memory is insufficient, this method can have impact on the performance of large queries due to the I/O overhead. The degree to which the performance is affected depends on the remaining memory space and the disk I/O speed.
- By enabling
-
3.2.4 Typical reasons for OOM
-
The SQL query has
join
. If you view the SQL statement by usingexplain
, you can find that thejoin
operation selects theHashJoin
algorithm and theinner
table is large. -
The data volume of a single
UPDATE/DELETE
query is too large. See case-882 in Chinese. -
The SQL contains multiple sub-queries connected by
Union
. See case-1828 in Chinese.
-
-
3.3.1 Symptom
-
SQL query execution time is much longer compared with that of previous executions, or the execution plan suddenly changes. If the execution plan is logged in the slow log, you can directly compare the execution plans.
-
SQL query execution time is much longer compared with that of other databases such as MySQL. Compare the execution plan with other databases to see the differences, such as
Join Order
. -
In slow log, the number of SQL execution time
Scan Keys
is large.
-
-
3.3.2 Investigate the execution plan
-
explain analyze {SQL}
. When the execution time is acceptable, comparecount
in the result ofexplain analyze
and the number ofrow
inexecution info
. If a large difference is found in theTableScan/IndexScan
row, it is likely that the statistics is incorrect. If a large difference is found in other rows, the problem might not be in the statistics. -
select count(*)
. When the execution plan contains ajoin
operation,explain analyze
might take a long time. You can check whether the problem is in the statistics by executingselect count(*)
for the conditions onTableScan/IndexScan
and comparing therow count
information in theexplain
result.
-
-
3.3.3 Mitigation
-
For v3.0 and later versions, use the
SQL Bind
feature to bind the execution plan. -
Update the statistics. If you are roughly sure that the problem is caused by the statistics, dump the statistics. If the cause is outdated statistics, such as the
modify count/row count
inshow stats_meta
is greater than a certain value (for example, 0.3), or the table has an index of time column, you can try recovering by usinganalyze table
. Ifauto analyze
is configured, check whether thetidb_auto_analyze_ratio
system variable is too large (for example, greater than 0.3), and whether the current time is betweentidb_auto_analyze_start_time
andtidb_auto_analyze_end_time
. -
For other situations, report a bug.
-
-
3.4.1 The client reports the
ERROR 1265(01000) Data Truncated
error. This is because the way TiDB internally calculates the precision ofDecimal
type is incompatible with that of MySQL. This issue has been fixed in v3.0.10 (#14438).-
Cause:
In MySQL, if two large-precision
Decimal
are divided and the result exceeds the maximum decimal precision (30
), only30
digits are reserved and no error is reported;In TiDB, the calculation result is the same as in MySQL, but inside the data structure that represents
Decimal
, a field for decimal precision still retains the actual precision.Take
(0.1^30) / 10
as an example. The results in TiDB and MySQL are both0
, because the precision is30
at most. However, in TiDB, the field for decimal precision is still31
.After multiple
Decimal
divisions, even though the result is correct, this precision field could grow larger and larger, and eventually exceeds the threshold in TiDB (72
), and theData Truncated
error is reported.The multiplication of
Decimal
does not have this issue, because the out-of-bounds is bypassed, and the precision is set to the maximum precision limit. -
Solution: You can bypass this issue by manually adding
Cast(xx as decimal(a, b))
, in whicha
andb
are the target precisions.
-
-
4.1.1
sync-log = false
. Theunexpected raft log index: last_index X < applied_index Y
error is returned after the machine is powered off.This issue is expected. You can restore the Region using
tikv-ctl
. -
4.1.2 If TiKV is deployed on a virtual machine, when the virtual machine is killed or the physical machine is powered off, the
entries[X, Y] is unavailable from storage
error is reported.This issue is expected. The
fsync
of virtual machines is not reliable, so you need to restore the Region usingtikv-ctl
. -
4.1.3 For other unexpected causes, report a bug.
-
4.2.1 If the
block-cache
configuration is too large, it might cause OOM.To verify the cause of the problem, check the
block cache size
of RocksDB by selecting the corresponding instance in the monitor Grafana -> TiKV-details.Meanwhile, check whether the
[storage.block-cache] capacity = # "1GB"
parameter is set properly. By default, TiKV'sblock-cache
is set to45%
of the total memory of the machine. You need to explicitly specify this parameter when you deploy TiKV in the container, because TiKV obtains the memory of the physical machine, which might exceed the memory limit of the container. -
4.2.2 Coprocessor receives many large queries and returns a large volume of data. gRPC fails to send data as quickly as the coprocessor returns data, which results in OOM.
To verify the cause, you can check whether
response size
exceeds thenetwork outbound
traffic by viewing the monitor Grafana -> TiKV-details -> coprocessor overview. -
4.2.3 Other components occupy too much memory.
This issue is unexpected. You can report a bug.
Check the specific cause for busy by viewing the monitor Grafana -> TiKV -> errors. server is busy
is caused by the flow control mechanism of TiKV, which informs tidb/ti-client
that TiKV is currently under too much pressure and will retry later.
-
4.3.1 TiKV RocksDB encounters
write stall
.A TiKV instance has two RocksDB instances, one in
data/raft
to save the Raft log, another indata/db
to save the real data. You can check the specific cause for stall by runninggrep "Stalling" RocksDB
in the log. The RocksDB log is a file starting withLOG
, andLOG
is the current log.-
Too many
level0 sst
causes stall. You can add the[rocksdb] max-sub-compactions = 2
(or 3) parameter to speed uplevel0 sst
compaction. The compaction task from level0 to level1 is divided into several subtasks (the max number of subtasks is the value ofmax-sub-compactions
) to be executed concurrently. See case-815 in Chinese. -
Too many
pending compaction bytes
causes stall. The disk I/O fails to keep up with the write operations in business peaks. You can mitigate this problem by increasing thesoft-pending-compaction-bytes-limit
andhard-pending-compaction-bytes-limit
of the corresponding CF.-
The default value of
[rocksdb.defaultcf] soft-pending-compaction-bytes-limit
is64GB
. If the pending compaction bytes reaches the threshold, RocksDB slows down the write speed. You can set[rocksdb.defaultcf] soft-pending-compaction-bytes-limit
to128GB
. -
The default value of
hard-pending-compaction-bytes-limit
is256GB
. If the pending compaction bytes reaches the threshold (this is not likely to happen, because RocksDB slows down the write after the pending compaction bytes reachessoft-pending-compaction-bytes-limit
), RocksDB stops the write operation. You can sethard-pending-compaction-bytes-limit
to512GB
. -
If the disk I/O capacity fails to keep up with the write for a long time, it is recommended to scale up your disk. If the disk throughput reaches the upper limit and causes write stall (for example, the SATA SSD is much lower than NVME SSD), while the CPU resources is sufficient, you may apply a compression algorithm of higher compression ratio. This way, the CPU resources is traded for disk resources, and the pressure on the disk is eased.
-
If the default CF compaction sees a high pressure, change the
[rocksdb.defaultcf] compression-per-level
parameter from["no", "no", "lz4", "lz4", "lz4", "zstd", "zstd"]
to["no", "no", "zstd", "zstd", "zstd", "zstd", "zstd"]
.
-
-
Too many memtables causes stall. This usually occurs when the amount of instant writes is large and the memtables flush to the disk slowly. If the disk write speed cannot be improved, and this issue only occurs during business peaks, you can mitigate it by increasing the
max-write-buffer-number
of the corresponding CF.- For example, set
[rocksdb.defaultcf] max-write-buffer-number
to8
(5
by default). Note that this might cause more memory usage in the peak, because more memtables might be in the memory.
- For example, set
-
-
4.3.2
scheduler too busy
-
Serious write conflict.
latch wait duration
is high. You can viewlatch wait duration
in the monitor Grafana -> TiKV-details -> scheduler prewrite/scheduler commit. When the write tasks pile up in the scheduler, the pending write tasks exceed the threshold set in[storage] scheduler-pending-write-threshold
(100MB). You can verify the cause by viewing the metric corresponding toMVCC_CONFLICT_COUNTER
. -
Slow write causes write tasks to pile up. The data being written to TiKV exceeds the threshold set by
[storage] scheduler-pending-write-threshold
(100MB). Refer to 4.5.
-
-
4.3.3
raftstore is busy
. The processing of messages is slower than the receiving of messages. The short-termchannel full
status does not affect the service, but if the error persists for a long time, it might cause Leader switch.append log
encounters stall. Refer to 4.3.1.append log duration
is high, which causes slow processing of messages. You can refer to 4.5 to analyze whyappend log duration
is high.- raftstore receives a large batch of messages in an instant (check in the TiKV Raft messages dashboard), and fails to process them. Usually the short-term
channel full
status does not affect the service.
-
4.3.4 TiKV coprocessor is in a queue. The number of piled up tasks exceeds
coprocessor threads * readpool.coprocessor.max-tasks-per-worker-[normal|low|high]
. Too many large queries leads to the tasks piling up in coprocessor. You need to check whether a execution plan change causes a large number of table scan operations. Refer to 3.3.
-
4.4.1 Re-election because TiKV is restarted
-
After TiKV panics, it is pulled up by systemd and runs normally. You can check whether panic has occurred by viewing the TiKV log. Because this issue is unexpected, report a bug if it happens.
-
TiKV is stopped or killed by a third party and then pulled up by systemd. Check the cause by viewing
dmesg
and the TiKV log. -
TiKV is OOM, which causes restart. Refer to 4.2.
-
TiKV is hung because of dynamically adjusting
THP
(Transparent Hugepage). See case case-500 in Chinese.
-
-
4.4.2 TiKV RocksDB encounters write stall and thus results in re-election. You can check if the monitor Grafana -> TiKV-details -> errors shows
server is busy
. Refer to 4.3.1. -
4.4.3 Re-election because of network isolation.
-
4.5.1 Check whether the TiKV write is low by viewing the
prewrite/commit/raw-put
duration of TiKV gRPC (only for raw KV clusters). Generally, you can locate the slow phase according to the performance-map. Some common situations are listed as follows. -
4.5.2 The scheduler CPU is busy (only for transaction kv).
The
scheduler command duration
of prewrite/commit is longer than the sum ofscheduler latch wait duration
andstorage async write duration
. The scheduler worker has a high CPU demand, such as over 80% ofscheduler-worker-pool-size
* 100%, or the CPU resources of the entire machine are relatively limited. If the write workload is large, check if[storage] scheduler-worker-pool-size
is set too small.For other situations, report a bug.
-
4.5.3 Append log is slow.
The Raft IO/
append log duration
in TiKV Grafana is high, usually because the disk write operation is slow. You can verify the cause by checking theWAL Sync Duration max
value of RocksDB - raft.For other situations, report a bug.
-
4.5.4 The raftstore thread is busy.
The Raft Propose/
propose wait duration
is significantly larger than the append log duration in TiKV Grafana. Take the following methods:- Check whether the
[raftstore] store-pool-size
configuration value is too small. It is recommended to set the value between1
and5
and not too large. - Check whether the CPU resources on the machine are insufficient.
- Check whether the
-
4.5.5 Apply is slow.
The Raft IO/
apply log duration
in TiKV Grafana is high, which usually comes with a high Raft Propose/apply wait duration
. The possible causes are as follows:-
[raftstore] apply-pool-size
is too small (it is recommended to set the value between1
and5
and not too large), and the Thread CPU/apply CPU
is large. -
The CPU resources on the machine are insufficient.
-
Region write hot spot. A single apply thread has high CPU usage. Currently, we cannot properly address the hot spot problem on a single Region, which is being improved. To view the CPU usage of each thread, modify the Grafana expression and add
by (instance, name)
. -
RocksDB write is slow. RocksDB kv/
max write duration
is high. A single Raft log might contain multiple KVs. When writing into RocksDB, 128 KVs are written into RocksDB in a write batch. Therefore, an apply log might be associated with multiple writes in RocksDB. -
For other situations, report a bug.
-
-
4.5.6 Raft commit log is slow.
The Raft IO/
commit log duration
in TiKV Grafana is high (this metric is only supported in Grafana after v4.x). Every Region corresponds to an independent Raft group. Raft has a flow control mechanism, similar to the sliding window mechanism of TCP. You can control the size of the sliding window by configuring the[raftstore] raft-max-inflight-msgs = 256
parameter. If there is a write hot spot and thecommit log duration
is high, you can adjust the parameter, such as increasing it to1024
. -
4.5.7 For other situations, refer to the write path on performance-map and analyze the cause.
-
5.1.1 Merge
-
Empty Regions across tables cannot be merged. You need to modify the
[coprocessor] split-region-on-table
parameter in TiKV, which is set tofalse
in v4.x by default. See case-896 in Chinese. -
Region merge is slow. You can check whether the merged operator is generated by accessing the monitor dashboard in Grafana -> PD -> operator. To accelerate the merge, increase the value of
merge-schedule-limit
.
-
-
5.1.2 Add replicas or take replicas online/offline
-
The TiKV disk uses 80% of the capacity, and PD does not add replicas. In this situation, the number of miss peers increases, so TiKV needs to be scaled out. See case-801 in Chinese.
-
When a TiKV node is taken offline, some Region cannot be migrated to other nodes. This issue has been fixed in v3.0.4 (#5526). See case-870 in Chinese.
-
-
5.1.3 Balance
- The Leader/Region count is not evenly distributed. See case-394 and case-759 in Chinese. The major cause is that the balance performs scheduling based on the size of Region/Leader, so this might result in the uneven distribution of the count. In TiDB 4.0, the
[leader-schedule-policy]
parameter is introduced, which enables you to set the scheduling policy of Leader to becount
-based orsize
-based.
- The Leader/Region count is not evenly distributed. See case-394 and case-759 in Chinese. The major cause is that the balance performs scheduling based on the size of Region/Leader, so this might result in the uneven distribution of the count. In TiDB 4.0, the
-
5.2.1 PD switches Leader.
-
Cause 1: Disk. The disk where the PD node is located has full I/O load. Investigate whether PD is deployed with other components with high I/O demand and the health of the disk. You can verify the cause by viewing the monitor metrics in Grafana -> disk performance -> latency/load. You can also use the FIO tool to run a check on the disk if necessary. See case-292 in Chinese.
-
Cause 2: Network. The PD log shows
lost the TCP streaming connection
. You need to check whether there is a problem with the network between PD nodes and verify the cause by viewinground trip
in the monitor Grafana -> PD -> etcd. See case-177 in Chinese. -
Cause 3: High system load. The log shows
server is likely overloaded
. See case-214 in Chinese.
-
-
5.2.2 PD cannot elect a Leader or the election is slow.
-
PD cannot elect a Leader: The PD log shows
lease is not expired
. This issue has been fixed in v3.0.x and v2.1.19. See case-875 in Chinese. -
The election is slow: The Region loading duration is long. You can check this issue by running
grep "regions cost"
in the PD log. If the result is in seconds, such asload 460927 regions cost 11.77099s
, it means the Region loading is slow. You can enable theregion storage
feature in v3.0 by settinguse-region-storage
totrue
, which significantly reduce the Region loading duration. See case-429 in Chinese.
-
-
5.2.3 PD timed out when TiDB executes SQL statements.
-
PD doesn't have a Leader or switches Leader. Refer to 5.2.1 and 5.2.2.
-
Network issue. Check whether the network from TiDB to PD Leader is running normally by accessing the monitor Grafana -> blackbox_exporter -> ping latency.
-
PD panics. Report a bug.
-
PD is OOM. Refer to 5.3.
-
If the issue has other causes, get goroutine by running
curl http://127.0.0.1:2379/debug/pprof/goroutine?debug=2
and report a bug.
-
-
5.2.4 Other issues
-
PD reports the
FATAL
error, and the log showsrange failed to find revision pair
. This issue has been fixed in v3.0.8 (#2040). For details, see case-947 in Chinese. -
For other situations, report a bug.
-
-
5.3.1 When the
/api/v1/regions
interface is used, too many Regions might cause PD OOM. This issue has been fixed in v3.0.8 (#1986). -
5.3.2 PD OOM during the rolling upgrade. The size of gRPC messages is not limited, and the monitor shows that TCP InSegs is relatively large. This issue has been fixed in v3.0.6 (#1952).
- 5.4.1 The monitor in Grafana -> PD -> cluster -> role displays follower. The Grafana expression issue has been fixed in v3.0.8 (#1065). For details, see case-1022.
-
6.1.1 TiDB Binlog is a tool that collects changes from TiDB and provides backup and replication to downstream TiDB or MySQL platforms. For details, see TiDB Binlog on GitHub.
-
6.1.2 The
Update Time
in Pump/Drainer Status is updated normally, and no anomaly shows in the log, but no data is written to the downstream.- Binlog is not enabled in the TiDB configuration. Modify the
[binlog]
configuration in TiDB.
- Binlog is not enabled in the TiDB configuration. Modify the
-
6.1.3
sarama
in Drainer reports theEOF
error.- The Kafka client version in Drainer is inconsistent with the version of Kafka. You need to modify the
[syncer.to] kafka-version
configuration.
- The Kafka client version in Drainer is inconsistent with the version of Kafka. You need to modify the
-
6.1.4 Drainer fails to write to Kafka and panics, and Kafka reports the
Message was too large
error.-
The binlog data is too large, so the single message written to Kafka is too large. You need to modify the following configuration of Kafka:
message.max.bytes=1073741824 replica.fetch.max.bytes=1073741824 fetch.message.max.bytes=1073741824
For details, see case-789 in Chinese.
-
-
6.1.5 Inconsistent data in upstream and downstream
-
Some TiDB nodes do not enable binlog. For v3.0.6 or later versions, you can check the binlog status of all the nodes by accessing the http://127.0.0.1:10080/info/all interface. For versions earlier than v3.0.6, you can check the binlog status by viewing the configuration file.
-
Some TiDB nodes go into the
ignore binlog
status. For v3.0.6 or later versions, you can check the binlog status of all the nodes by accessing the http://127.0.0.1:10080/info/all interface. For versions earlier than v3.0.6, check the TiDB log to see whether it contains theignore binlog
keyword. -
The value of the timestamp column is inconsistent in upstream and downstream.
-
This is caused by different time zones. You need to ensure that Drainer is in the same time zone as the upstream and downstream databases. Drainer obtains its time zone from
/etc/localtime
and does not support theTZ
environment variable. See case-826 in Chinese. -
In TiDB, the default value of timestamp is
null
, but the same default value in MySQL 5.7 (not including MySQL 8) is the current time. Therefore, when the timestamp in upstream TiDB isnull
and the downstream is MySQL 5.7, the data in the timestamp column is inconsistent. You need to runset @@global.explicit_defaults_for_timestamp=on;
in the upstream before enabling binlog.
-
-
For other situations, report a bug.
-
-
6.1.6 Slow replication
-
The downstream is TiDB/MySQL, and the upstream performs frequent DDL operations. See case-1023 in Chinese.
-
The downstream is TiDB/MySQL, and the table to be replicated has no primary key and no unique index, which causes reduced performance in binlog. It is recommended to add the primary key or unique index.
-
If the downstream outputs to files, check whether the output disk or network disk is slow.
-
For other situations, report a bug.
-
-
6.1.7 Pump cannot write binlog and reports the
no space left on device
error.- The local disk space is insufficient for Pump to write binlog data normally. You need to clean up the disk space and then restart Pump.
-
6.1.8 Pump reports the
fail to notify all living drainer
error when it is started.-
Cause: When Pump is started, it notifies all Drainer nodes that are in the
online
state. If it fails to notify Drainer, this error log is printed. -
Solution: Use the binlogctl tool to check whether each Drainer node is normal or not. This is to ensure that all Drainer nodes in the
online
state are working normally. If the state of a Drainer node is not consistent with its actual working status, use the binlogctl tool to change its state and then restart Pump. See the case fail-to-notify-all-living-drainer.
-
-
6.1.9 Drainer reports the
gen update sqls failed: table xxx: row data is corruption []
error.- Trigger: The upstream performs DML operations on this table while performing
DROP COLUMN
DDL. This issue has been fixed in v3.0.6. See case-820 in Chinese.
- Trigger: The upstream performs DML operations on this table while performing
-
6.1.10 Drainer replication is hung. The process remains active but the checkpoint is not updated.
- This issues has been fixed in v3.0.4. See case-741 in Chinese.
-
6.1.11 Any component panics.
-
6.2.1 TiDB Data Migration (DM) is a migration tool that supports data migration from MySQL/MariaDB into TiDB. For details, see DM on GitHub.
-
6.2.2
Access denied for user 'root'@'172.31.43.27' (using password: YES)
shows when you runquery status
or check the log.- The database related passwords in all the DM configuration files should be encrypted by
dmctl
. If a database password is empty, it is unnecessary to encrypt the password. Cleartext passwords can be used since v1.0.6. - During DM operation, the user of the upstream and downstream databases must have the corresponding read and write privileges. Data Migration also prechecks the corresponding privileges automatically while starting the data replication task.
- To deploy different versions of DM-worker/DM-master/dmctl in a DM cluster, see the case study on AskTUG in Chinese.
- The database related passwords in all the DM configuration files should be encrypted by
-
6.2.3 A replication task is interrupted with the
driver: bad connection
error returned.-
The
driver: bad connection
error indicates that an anomaly has occurred in the connection between DM and the downstream TiDB database (such as network failure, TiDB restart and so on), and that the data of the current request has not yet been sent to TiDB.- For versions earlier than DM 1.0.0 GA, stop the task by running
stop-task
and then restart the task by runningstart-task
. - For DM 1.0.0 GA or later versions, an automatic retry mechanism for this type of error is added. See #265.
- For versions earlier than DM 1.0.0 GA, stop the task by running
-
-
6.2.4 A replication task is interrupted with the
invalid connection
error.-
The
invalid connection
error indicates that an anomaly has occurred in the connection between DM and the downstream TiDB database (such as network failure, TiDB restart, TiKV busy and so on), and that a part of the data for the current request has been sent to TiDB. Because DM has the feature of concurrently replicating data to the downstream in replication tasks, several errors might occur when a task is interrupted. You can check these errors by runningquery-status
orquery-error
.- If only the
invalid connection
error occurs during the incremental replication process, DM retries the task automatically. - If DM does not retry or fails to retry automatically because of version problems (automatic retry is introduced in v1.0.0-rc.1), use
stop-task
to stop the task and then usestart-task
to restart the task.
- If only the
-
-
6.2.5 The relay unit reports the error
event from * in * diff from passed-in event *
, or a replication task is interrupted with an error that fails to get or parse binlog, such asget binlog error ERROR 1236 (HY000) and binlog checksum mismatch, data may be corrupted returned
-
During the process that DM pulls relay log or the incremental replication, this two errors might occur if the size of the upstream binlog file exceeds 4 GB.
-
Cause: When writing relay logs, DM needs to perform event verification based on binlog positions and the binlog file size, and store the replicated binlog positions as checkpoints. However, the official MySQL uses uint32 to store binlog positions, which means the binlog position for a binlog file over 4 GB overflows, and then the errors above occur.
-
Solution:
- For relay processing units, manually recover replication.
- For binlog replication processing units, manually recover replication.
-
-
6.2.6 The DM replication is interrupted, and the log returns
ERROR 1236 (HY000) The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires.
-
Check whether the master binlog is purged.
-
Check the position information recorded in
relay.meta
.-
relay.meta
has recorded the empty GTID information. DM-worker saves the GTID information in memory torelay.meta
when it exits or in every 30s. When DM-worker does not obtain the upstream GTID information, it saves the empty GTID information torelay.meta
. See case-772 in Chinese. -
The binlog event recorded in
relay.meta
triggers the incomplete recover process and records the wrong GTID information. This issue is fixed in v1.0.2, and might occur in earlier versions.
-
-
-
6.2.7 The DM replication process returns an error
Error 1366: incorrect utf8 value eda0bdedb29d(\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd)
.- This value cannot be successfully written into MySQL 8.0 or TiDB, but can be written into MySQL 5.7. You can skip the data format check by enabling the
tidb_skip_utf8_check
parameter.
- This value cannot be successfully written into MySQL 8.0 or TiDB, but can be written into MySQL 5.7. You can skip the data format check by enabling the
-
6.3.1 TiDB Lightning is a tool for fast full import of large amounts of data into a TiDB cluster. See TiDB Lightning on GitHub.
-
6.3.2 Import speed is too slow.
-
region-concurrency
is set too high, which causes thread contention and reduces performance. Three ways to troubleshoot:- The setting can be found from the start of the log by searching
region-concurrency
. - If TiDB Lightning shares a server with other services (for example, Importer), you must manually set
region-concurrency
to 75% of the total number of CPU cores on that server. - If there is a quota on CPU (for example, limited by Kubernetes settings), TiDB Lightning might not be able to read this out. In this case,
region-concurrency
must also be manually reduced.
- The setting can be found from the start of the log by searching
-
Every additional index introduces a new KV pair for each row. If there are N indices, the actual size to be imported would be approximately (N+1) times the size of the Mydumper output. If the indices are negligible, you may first remove them from the schema, and add them back via
CREATE INDEX
after the import is complete. -
The version of TiDB Lightning is old. Try the latest version, which might improve the import speed.
-
-
6.3.3
checksum failed: checksum mismatched remote vs local
.-
Cause 1: The table might already have data. These old data can affect the final checksum.
-
Cause 2: If the checksum of the target database is 0, which means nothing is imported, it is possible that the cluster is too hot and fails to take in any data.
-
Cause 3: If the data source is generated by the machine and not backed up by Mydumper, ensure it respects the constrains of the table. For example:
AUTO_INCREMENT
columns need to be positive, and do not contain the value “0”.- UNIQUE and PRIMARY KEYs must not have duplicate entries.
-
Solution: See Troubleshooting Solution.
-
-
6.3.4
Checkpoint for … has invalid status:(error code)
-
Cause: Checkpoint is enabled, and Lightning/Importer has previously abnormally exited. To prevent accidental data corruption, TiDB Lightning will not start until the error is addressed. The error code is an integer less than 25, with possible values as
0, 3, 6, 9, 12, 14, 15, 17, 18, 20 and 21
. The integer indicates the step where the unexpected exit occurs in the import process. The larger the integer is, the later the exit occurs. -
Solution: See Troubleshooting Solution.
-
-
6.3.5
ResourceTemporarilyUnavailable("Too many open engines …: 8")
- Cause: The number of concurrent engine files exceeds the limit specified by tikv-importer. This could be caused by misconfiguration. In addition, even when the configuration is correct, if tidb-lightning has exited abnormally before, an engine file might be left at a dangling open state, which could cause this error as well.
- Solution: See Troubleshooting Solution.
-
6.3.6
cannot guess encoding for input file, please convert to UTF-8 manually
-
Cause: TiDB Lightning only supports the UTF-8 and GB-18030 encodings. This error means the file is not in any of these encodings. It is also possible that the file has mixed encoding, such as containing a string in UTF-8 and another string in GB-18030, due to historical ALTER TABLE executions.
-
Solution: See Troubleshooting Solution.
-
-
6.3.7
[sql2kv] sql encode error = [types:1292]invalid time format: '{1970 1 1 0 45 0 0}'
-
Cause: A timestamp type entry has a time value that does not exist. This is either because of DST changes or because the time value has exceeded the supported range (from Jan 1, 1970 to Jan 19, 2038).
-
Solution: See Troubleshooting Solution.
-
-
7.1.1
GC life time is shorter than transaction duration
.The transaction duration exceeds the GC lifetime (10 minutes by default).
You can increase the GC lifetime by modifying the
tidb_gc_life_time
system variable. Generally, it is not recommended to modify this parameter, because changing it might cause many old versions to pile up if this transaction has a large number ofUPDATE
andDELETE
statements. -
7.1.2
txn takes too much time
.This error is returned when you commit a transaction that has not been committed for a long time (over 590 seconds).
If your application needs to execute a transaction of such a long time, you can increase the
[tikv-client] max-txn-time-use = 590
parameter and the GC lifetime to avoid this issue. It is recommended to check whether your application needs such a long transaction time. -
7.1.3
coprocessor.go
reportsrequest outdated
.This error is returned when the coprocessor request sent to TiKV waits in a queue at TiKV for over 60 seconds.
You need to investigate why the TiKV coprocessor is in a long queue.
-
7.1.4
region_cache.go
reports a large number ofswitch region peer to next due to send request fail
, and the error message iscontext deadline exceeded
.The request for TiKV timed out and triggers the region cache to switch the request to other nodes. You can continue to run the
grep "<addr> cancelled
command on theaddr
field in the log and take the following steps according to thegrep
results:-
send request is cancelled
: The request timed out during the sending phase. You can investigate the monitoring Grafana -> TiDB -> Batch Client/Pending Request Count by TiKV
and see whether the Pending Request Count is greater than 128:- If the value is greater than 128, the sending goes beyond the processing capacity of KV, so the sending piles up.
- If the value is not greater than 128, check the log to see if the report is caused by the operation and maintenance changes of the corresponding KV; otherwise, this error is unexpected, and you need to report a bug.
-
wait response is cancelled
: The request timed out after it is sent to TiKV. You need to check the response time of the corresponding TiKV address and the Region logs in PD and KV at that time.
-
-
7.1.5
distsql.go
reportsinconsistent index
.The data index seems to be inconsistent. Run the
admin check table <TableName>
command on the table where the reported index is. If the check fails, disable garbage collection by running the following command, and report a bug:SET GLOBAL tidb_gc_enable = 0;
-
7.2.1
key is locked
.The read and write have conflict. The read request encounters data that has not been committed and needs to wait until the data is committed.
A small number of this error has no impact on the business, but a large number of this error indicates that the read-write conflict is severe in your business.
-
7.2.2
write conflict
.This is the write-write conflict in optimistic transactions. If multiple transactions modify the same key, only one transaction succeed and other transactions automatically obtain the timestamp again and retry the operation, with no impact on the business.
If the conflict is severe, it might cause transaction failure after multiple retries. In this case, it is recommended to use the pessimistic lock.
-
7.2.3
TxnLockNotFound
.This transaction commit is too slow, which is rolled back by other transactions after TTL (3 seconds for a small transaction by default). This transaction will automatically retry, so the business is usually not affected.
-
7.2.4
PessimisticLockNotFound
.Similar to
TxnLockNotFound
. The pessimistic transaction commit is too slow and thus rolled back by other transactions. -
7.2.5
stale_epoch
.The request epoch is outdated, so TiDB re-sends the request after updating the routing. The business is not affected. Epoch changes when Region has a split/merge operation or a replica is migrated.
-
7.2.6
peer is not leader
.The request is sent to a replica that is not Leader. If the error response indicates which replica is the latest Leader, TiDB updates the local routing according the error and sends a new request to the latest Leader. Usually, the business is not affected.
In v3.0 and later versions, TiDB tries other peers if the request to the previous Leader fails, which might lead to frequent
peer is not leader
in TiKV log. You can check theswitch region peer to next due to send request fail
log of the corresponding Region in TiDB to determine the root cause of the sending failure. For details, refer to 7.1.4.This error might also be returned if a Region has no Leader due to other reasons. For details, see 4.4.