引言
MySQL 出现元数据锁等待属于高危现象,为降低风险,DBA维护了定时任务用于扫描并自动 kill 发生元数据锁等待的 SQL,但由于扫描 SQL 的过滤条件不够完善可能导致误伤。
本文介绍一个案例,定时任务 kill 掉的是从库回放时等待元数据锁的 DDL,因此导致复制中断。
分析过程中还发现了之前没有注意到的细节,比如 performance_schema.threads 表与 information_schema.processlist 表中的 user 不同。
现象
时间:230421 02:43:17
现象:从库复制中断
分析
复制报错
Last_SQL_Errno: 1317
mysql> SHOW SLAVE STATUS G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: x.x.x.x
Master_User: replicater
Master_Port: 3358
Connect_Retry: 60
Master_Log_File: mysql-bin.099648
Read_Master_Log_Pos: 477602352
Relay_Log_File: relay-log.034716
Relay_Log_Pos: 649610541
Relay_Master_Log_File: mysql-bin.099635
Slave_IO_Running: Yes
Slave_SQL_Running: No
Replicate_Do_DB:
Replicate_Ignore_DB: test
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 1317
Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction 'ab48115d-54f1-11ed-b1ad-98be943df1b2:1503740030' at master log mysql-bin.099635, end_log_pos 649610642. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
Skip_Counter: 0
Exec_Master_Log_Pos: 649610408
Relay_Log_Space: 15510713895
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 1317
Last_SQL_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction 'ab48115d-54f1-11ed-b1ad-98be943df1b2:1503740030' at master log mysql-bin.099635, end_log_pos 649610642. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
Replicate_Ignore_Server_Ids:
Master_Server_Id: 1277236
Master_UUID: ab48115d-54f1-11ed-b1ad-98be943df1b2
Master_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State:
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp: 230421 02:43:17
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set: ab48115d-54f1-11ed-b1ad-98be943df1b2:939017276-1505660384
Executed_Gtid_Set: 220a3b42-1b98-11e8-9faf-246e965a9b15:1-5457129831,
a8264a5f-1bdb-11e8-a610-246e965a9b89:1-359192:453572-4801979790,
ab48115d-54f1-11ed-b1ad-98be943df1b2:1-1503740029
Auto_Position: 0
Replicate_Rewrite_DB:
Channel_Name:
Master_TLS_Version:
1 row in set (0.02 sec)
mysql> select * from performance_schema.replication_applier_status_by_worker limit 1 G
*************************** 1. row ***************************
CHANNEL_NAME:
WORKER_ID: 1
THREAD_ID: NULL
SERVICE_STATE: OFF
LAST_SEEN_TRANSACTION: ab48115d-54f1-11ed-b1ad-98be943df1b2:1503740030
LAST_ERROR_NUMBER: 1317
LAST_ERROR_MESSAGE: Worker 1 failed executing transaction 'ab48115d-54f1-11ed-b1ad-98be943df1b2:1503740030' at master log mysql-bin.099635, end_log_pos 649610642; Error 'Query execution was interrupted' on query. Default database: 'report'. Query: 'alter table ob_check_worker modify column operate_value varchar(600) comment "执行信息"'
LAST_ERROR_TIMESTAMP: 2023-04-21 02:43:17
1 row in set (0.03 sec)
显示从库复制线程执行 DDL 被 kill,表明存在两个操作:
-
DML,持有元数据锁导致 DDL 被阻塞; -
kill 操作,导致复制中断。
因此分析该案例就是要找到这两个操作的发起方。
SQL
主库,分析报错 GTID 对应 SQL
[root@DB data]# /export/servers/mysql/bin/mysqlbinlog -uadmin -p -h --no-defaults -vv --base64-output=decode-rows --include-gtids='ab48115d-54f1-11ed-b1ad-98be943df1b2:1503740030' /export/data/mysql/data/mysql-bin.099635 > zk.log
查看 SQL
#230421 2:00:22 server id 1277236 end_log_pos 649610473 CRC32 0xaf63dd2f GTID last_committed=39254 sequence_number=39255 rbr_only=no
SET @@SESSION.GTID_NEXT= 'ab48115d-54f1-11ed-b1ad-98be943df1b2:1503740030'/*!*/;
# at 649610473
#230421 2:00:22 server id 1277236 end_log_pos 649610642 CRC32 0xe328c43d Query thread_id=3865104 exec_time=1074 error_code=0
use `report`/*!*/;
SET TIMESTAMP=1682013622/*!*/;
SET @@session.pseudo_thread_id=3865104/*!*/;
SET @@session.foreign_key_checks=1, @@session.sql_auto_is_null=0, @@session.unique_checks=1, @@session.autocommit=1/*!*/;
SET @@session.sql_mode=1411383296/*!*/;
SET @@session.auto_increment_increment=1, @@session.auto_increment_offset=1/*!*/;
/*!C utf8 *//*!*/;
SET @@session.character_set_client=33,@@session.collation_connection=33,@@session.collation_server=33/*!*/;
SET @@session.lc_time_names=0/*!*/;
SET @@session.collation_database=DEFAULT/*!*/;
alter table ob_check_worker modify column operate_value varchar(600) comment "执行信息"
从库,查看错误日志
2023-04-21T02:43:17.918277+08:00 7 [ERROR] Slave SQL for channel '': Worker 1 failed executing transaction 'ab48115d-54f1-11ed-b1ad-98be943df1b2:1503740030' at master log mysql-bin.099635, end_log_pos 649610642; Error 'Query execution was interrupted' on query. Default database: 'report'. Query: 'alter table ob_check_worker modify column operate_value varchar(600) comment "执行信息"', Error_code: 1317
SQL
alter table ob_check_worker modify column operate_value varchar(600) comment "执行信息";
从库执行报错
Error 'Query execution was interrupted' on query
可见,从库报错事务与主库写入的事务相同,都是这条 DDL。
可是,什么操作会 kill 从库的 DDL,又是什么操作会阻塞 DDL 呢?
由于复制中断的时间为 2023-04-21 02:43:17,因此怀疑与备份有关。
查看备份时间
[root@localhost dumps]# cat 2023-04-21/metadata
Started dump at: 2023-04-21 03:01:05
SHOW MASTER STATUS:
Log: mysql-bin.099617
Pos: 985607115
GTID:220a3b42-1b98-11e8-9faf-246e965a9b15:1-5457129831,
a8264a5f-1bdb-11e8-a610-246e965a9b89:1-359192:453572-4801979790,
ab48115d-54f1-11ed-b1ad-98be943df1b2:1-1503740029
SHOW SLAVE STATUS:
Host: x.x.x.x
Log: mysql-bin.099635
Pos: 649610408
GTID:220a3b42-1b98-11e8-9faf-246e965a9b15:1-5457129831,
a8264a5f-1bdb-11e8-a610-246e965a9b89:1-359192:453572-4801979790,
ab48115d-54f1-11ed-b1ad-98be943df1b2:1-1503740029
Finished dump at: 2023-04-21 03:49:29
由于这里仅展示成功的备份,因此再次检查全部备份任务。
备份
备份任务显示 2023-04-21 有两次备份,其中第一个备份失败,开始时间与结束时间分别为 2023-04-21 02:12:07 与 2023-04-21 02:50:08。
查看失败任务的备份日志。
======================================
[2023-04-21 02:12:09] start backup
[2023-04-21 02:12:09] backup method:mysqldump
[2023-04-21 02:12:09] backup keep days:0
[2023-04-21 02:12:09] this is slave server
[2023-04-21 02:12:10] check mydumper
[2023-04-21 02:12:12] mydumper has been installed
[2023-04-21 02:12:12] start backup with mydumper
[2023-04-21 02:43:17] mydumper finish with failed
[2023-04-21 02:43:17] MySQL databases backup failed
[2023-04-21 02:43:17] error in log
[2023-04-21 02:43:17] mydumper finish with failed
MySQL databases backup failed
其中显示 2023-04-21 02:43:17 备份失败。
检查备份命令,使用逻辑备份 mydumper。
echo_info "start backup with mydumper" >> ${mysql_backup_log}
mydumper
--host=127.0.0.1
--port=3358
--user="${user_name}"
--password="${user_password}"
--compress
--threads=6
--build-empty-files
--trx-consistency-only --rows=10000
--long-query-guard=120
--statement-size=4000000
--outputdir="${current_backup_dir}" 1>${mysql_dump_Log} 2>${mysql_dump_Err}
if [[ $? -eq 0 ]]
then
echo_info "mydumper finish with success" >> ${mysql_backup_log}
else
echo_info "mydumper finish with failed" >> ${mysql_backup_log}
echo_info "${mysql_backup_failed}" >> ${mysql_backup_log}
echo_info "mydumper finish with failed" >> ${mysql_dump_Err}
fi
其中:
-
long-query-guard=120,表示如果有长查询阻塞备份,120s 超时后备份退出; -
kill-long-queries,未指定该参数,否则如果有长查询阻塞备份,120s 超时后 kill 长查询,备份不退出。
表明备份不会 kill 长查询,因此 kill 操作并非备份发起,接下来查看定时任务。
定时任务 – kill 元数据锁
定时任务显示有两条自动 kill 操作。
第一个 kill 掉了该 DDL,因此导致复制中断。
第二个 kill 掉了备份,因此导致备份失败。
梳理时间线:
-
2023-04-21 02:43,发送邮件,只读库元数据锁,已kill -
2023-04-21 02:43:17,备份失败 -
2023-04-21 02:43:17,复制停止
怀疑是备份导致 DDL 元数据锁等待,但是理论上只需要 kill 备份,但是也 kill 掉了 DDL。
因此仍存在以下三个问题:
-
DDL 被阻塞,可是为什么先 kill DDL,而且 DDL 的 user 显示 admin,不应该是 replicater 或 system user 吗? -
为什么先 kill DDL 后 kill select,而且备份的执行时间显示 2023-04-21 02:12:12,执行耗时显示 0? -
为什么显示的备份 SQL 与 DDL 不是同一张表?
定时任务判断是否发生元数据锁等待的 SQL 如下所示。
select count(*) as lock_num from information_schema.processlist where STATE like '%metadata lock%';
发现元数据锁等待时,扫描需要 kill 的 SQL 如下所示,注意其中没有指定排序规则。
select
t3.trx_mysql_thread_id,
t2.SQL_TEXT,
t1.PROCESSLIST_DB,
t1.PROCESSLIST_USER,
t1.PROCESSLIST_HOST,
t1.PROCESSLIST_COMMAND,
t3.trx_started,
t4.TIME
from
performance_schema.threads t1
join performance_schema.events_statements_current t2 on t1.THREAD_ID = t2.THREAD_ID
join information_schema.innodb_trx t3 on t3.trx_mysql_thread_id = t1.PROCESSLIST_ID
join information_schema.processlist t4 on t4.ID = t3.trx_mysql_thread_id
where
t1.PROCESSLIST_USER not in("bi_cx", "magpie")
and t4.Command not like "%Binlog Dump%"
其中:
-
过滤条件,bi_cx、magpie 是抽数账号; -
过滤条件,Binlog Dump 表明是主库 dump 线程。
处理
将复制停止后重新启动恢复正常。
mysql> stop slave;
Query OK, 0 rows affected (0.03 sec)
mysql> start slave;
Query OK, 0 rows affected (0.07 sec)
mysql> SHOW SLAVE STATUS G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: x.x.x.x
Master_User: replicater
Master_Port: 3358
Connect_Retry: 60
Master_Log_File: mysql-bin.099649
Read_Master_Log_Pos: 284946684
Relay_Log_File: relay-log.034716
Relay_Log_Pos: 649610541
Relay_Master_Log_File: mysql-bin.099635
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB: test
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 649610408
Relay_Log_Space: 16391802250
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 33267
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 1277236
Master_UUID: ab48115d-54f1-11ed-b1ad-98be943df1b2
Master_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Waiting for dependent transaction to commit
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp:
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set: ab48115d-54f1-11ed-b1ad-98be943df1b2:939017276-1505791337
Executed_Gtid_Set: 220a3b42-1b98-11e8-9faf-246e965a9b15:1-5457129831,
a8264a5f-1bdb-11e8-a610-246e965a9b89:1-359192:453572-4801979790,
ab48115d-54f1-11ed-b1ad-98be943df1b2:1-1503740029
Auto_Position: 0
Replicate_Rewrite_DB:
Channel_Name:
Master_TLS_Version:
1 row in set (0.02 sec)
接下来分析 DDL 被 kill 的原因。
复现
select + ddl
如下所示,如果从库先执行 select,然后主库执行 ddl,innodb_trx 表中只能查到 select。
slave1 | master | slave2 |
---|---|---|
select id, sleep(60) from t3_bak limit 1; | ||
alter table t3_bak engine=innodb; | ||
select * from information_schema.innodb_trx select id, sleep(60) from t3_bak limit 1; |
执行show processlist
命令显示发生元数据锁等待。
mysql> SHOW PROCESSLIST;
+----+-------------+-----------------+---------+---------+------+--------------------------------------------------------+------------------------------------------+
| Id | User | Host | db | Command | Time | State | Info |
+----+-------------+-----------------+---------+---------+------+--------------------------------------------------------+------------------------------------------+
| 2 | admin | 127.0.0.1:34468 | NULL | Query | 0 | starting | SHOW PROCESSLIST |
| 3 | system user | | NULL | Connect | 1839 | Waiting for master to send event | NULL |
| 4 | system user | | NULL | Connect | 10 | Slave has read all relay log; waiting for more updates | NULL |
| 5 | system user | | test_zk | Connect | 45 | Waiting for table metadata lock | alter table t3_bak engine=innodb |
| 6 | system user | | NULL | Connect | 1839 | Waiting for an event from Coordinator | NULL |
| 7 | system user | | NULL | Connect | 1839 | Waiting for an event from Coordinator | NULL |
| 8 | system user | | NULL | Connect | 1839 | Waiting for an event from Coordinator | NULL |
| 9 | system user | | NULL | Connect | 1839 | Waiting for an event from Coordinator | NULL |
| 10 | system user | | NULL | Connect | 1839 | Waiting for an event from Coordinator | NULL |
| 11 | system user | | NULL | Connect | 1839 | Waiting for an event from Coordinator | NULL |
| 12 | system user | | NULL | Connect | 1839 | Waiting for an event from Coordinator | NULL |
| 13 | admin | 127.0.0.1:42606 | test_zk | Query | 48 | User sleep | select id, sleep(60) from t3_bak limit 1 |
+----+-------------+-----------------+---------+---------+------+--------------------------------------------------------+------------------------------------------+
12 rows in set (0.00 sec)
查看元数据锁等待 SQL,显示查到了一条记录,原因是事务表 innodb_trx 中还没有该 DDL。
mysql> select t3.trx_mysql_thread_id, t2.SQL_TEXT, t1.PROCESSLIST_DB, t1.PROCESSLIST_USER, t1.PROCESSLIST_HOST, t1.PROCESSLIST_COMMAND, t3.trx_started, t4.TIME from performance_schema.threads t1 join performance_schema.events_statements_current t2 on t1.THREAD_ID = t2.THREAD_ID join information_schema.innodb_trx t3 on t3.trx_mysql_thread_id = t1.PROCESSLIST_ID join information_schema.processlist t4 on t4.ID = t3.trx_mysql_thread_id where t1.PROCESSLIST_USER not in("bi_cx", "magpie") and t4.Command not like "%Binlog Dump%" G
*************************** 1. row ***************************
trx_mysql_thread_id: 13
SQL_TEXT: select id, sleep(60) from t3_bak limit 1
PROCESSLIST_DB: test_zk
PROCESSLIST_USER: admin
PROCESSLIST_HOST: 127.0.0.1
PROCESSLIST_COMMAND: Query
trx_started: 2023-04-25 17:46:40
TIME: 59
1 row in set (0.00 sec)
mysql> select * from information_schema.innodb_trxG
*************************** 1. row ***************************
trx_id: 8017
trx_state: RUNNING
trx_started: 2023-04-25 17:47:40
trx_requested_lock_id: NULL
trx_wait_started: NULL
trx_weight: 0
trx_mysql_thread_id: 5
trx_query: alter table t3_bak engine=innodb
trx_operation_state: reading clustered index
trx_tables_in_use: 1
trx_tables_locked: 0
trx_lock_structs: 0
trx_lock_memory_bytes: 1136
trx_rows_locked: 0
trx_rows_modified: 0
trx_concurrency_tickets: 0
trx_isolation_level: REPEATABLE READ
trx_unique_checks: 1
trx_foreign_key_checks: 1
trx_last_foreign_key_error: NULL
trx_adaptive_hash_latched: 0
trx_adaptive_hash_timeout: 0
trx_is_read_only: 0
trx_autocommit_non_locking: 0
1 row in set (0.00 sec)
从库执行 DDL 的线程如下所示。
mysql> select * from information_schema.processlist where info='alter table t3_bak engine=innodb' G
*************************** 1. row ***************************
ID: 5
USER: system user
HOST:
DB: test_zk
COMMAND: Connect
TIME: 50
STATE: Waiting for table metadata lock
INFO: alter table t3_bak engine=innodb
1 row in set (0.00 sec)
其中:
-
USER: system user -
COMMAND: Connect -
STATE: Waiting for table metadata lock
ddl + select
如下所示,如果主库执行 ddl,然后从库先执行 select,innodb_trx 表中可以同时查到 select 与 DDL。
master | slave 1 | slave 2 |
---|---|---|
alter table t3_bak engine=innodb; | ||
select id, sleep(60) from t3_bak limit 1; | ||
select * from information_schema.innodb_trx select id, sleep(60) from t3_bak limit 1; alter table t3_bak engine=innodb; |
执行show processlist
命令显示发生元数据锁等待。
mysql> SHOW PROCESSLIST;
+----+-------------+-----------------+---------+---------+------+--------------------------------------------------------+------------------------------------------+
| Id | User | Host | db | Command | Time | State | Info |
+----+-------------+-----------------+---------+---------+------+--------------------------------------------------------+------------------------------------------+
| 2 | admin | 127.0.0.1:53079 | NULL | Query | 0 | starting | SHOW PROCESSLIST |
| 3 | system user | | NULL | Connect | 220 | Waiting for master to send event | NULL |
| 4 | system user | | NULL | Connect | 43 | Slave has read all relay log; waiting for more updates | NULL |
| 5 | system user | | test_zk | Connect | 76 | Waiting for table metadata lock | alter table t3_bak engine=innodb |
| 6 | system user | | NULL | Connect | 220 | Waiting for an event from Coordinator | NULL |
| 7 | system user | | NULL | Connect | 220 | Waiting for an event from Coordinator | NULL |
| 8 | system user | | NULL | Connect | 220 | Waiting for an event from Coordinator | NULL |
| 9 | system user | | NULL | Connect | 220 | Waiting for an event from Coordinator | NULL |
| 10 | system user | | NULL | Connect | 220 | Waiting for an event from Coordinator | NULL |
| 11 | system user | | NULL | Connect | 220 | Waiting for an event from Coordinator | NULL |
| 12 | system user | | NULL | Connect | 220 | Waiting for an event from Coordinator | NULL |
| 13 | admin | 127.0.0.1:60238 | test_zk | Query | 41 | User sleep | select id, sleep(60) from t3_bak limit 1 |
+----+-------------+-----------------+---------+---------+------+--------------------------------------------------------+------------------------------------------+
12 rows in set (0.00 sec)
查看元数据锁等待 SQL,显示查到了两条记录,原因是事务表 innodb_trx 中有该 DDL。
mysql> select t3.trx_mysql_thread_id, t2.SQL_TEXT, t1.PROCESSLIST_DB, t1.PROCESSLIST_USER, t1.PROCESSLIST_HOST, t1.PROCESSLIST_COMMAND, t3.trx_started, t4.TIME from performance_schema.threads t1 join performance_schema.events_statements_current t2 on t1.THREAD_ID = t2.THREAD_ID join information_schema.innodb_trx t3 on t3.trx_mysql_thread_id = t1.PROCESSLIST_ID join information_schema.processlist t4 on t4.ID = t3.trx_mysql_thread_id where t1.PROCESSLIST_USER not in("bi_cx", "magpie") and t4.Command not like "%Binlog Dump%" G
*************************** 1. row ***************************
trx_mysql_thread_id: 5
SQL_TEXT: alter table t3_bak engine=innodb
PROCESSLIST_DB: NULL
PROCESSLIST_USER: admin
PROCESSLIST_HOST: 127.0.0.1
PROCESSLIST_COMMAND: Connect
trx_started: 2023-04-25 18:09:41
TIME: 83
*************************** 2. row ***************************
trx_mysql_thread_id: 13
SQL_TEXT: select id, sleep(60) from t3_bak limit 1
PROCESSLIST_DB: test_zk
PROCESSLIST_USER: admin
PROCESSLIST_HOST: 127.0.0.1
PROCESSLIST_COMMAND: Query
trx_started: 2023-04-25 18:09:43
TIME: 48
2 rows in set (0.00 sec)
mysql> select * from information_schema.innodb_trxG
*************************** 1. row ***************************
trx_id: 8598
trx_state: RUNNING
trx_started: 2023-04-25 18:09:41
trx_requested_lock_id: NULL
trx_wait_started: NULL
trx_weight: 0
trx_mysql_thread_id: 5
trx_query: alter table t3_bak engine=innodb
trx_operation_state: NULL
trx_tables_in_use: 1
trx_tables_locked: 0
trx_lock_structs: 0
trx_lock_memory_bytes: 1136
trx_rows_locked: 0
trx_rows_modified: 0
trx_concurrency_tickets: 0
trx_isolation_level: REPEATABLE READ
trx_unique_checks: 1
trx_foreign_key_checks: 1
trx_last_foreign_key_error: NULL
trx_adaptive_hash_latched: 0
trx_adaptive_hash_timeout: 0
trx_is_read_only: 0
trx_autocommit_non_locking: 0
*************************** 2. row ***************************
trx_id: 421783441636992
trx_state: RUNNING
trx_started: 2023-04-25 18:09:43
trx_requested_lock_id: NULL
trx_wait_started: NULL
trx_weight: 0
trx_mysql_thread_id: 13
trx_query: select id, sleep(60) from t3_bak limit 1
trx_operation_state: NULL
trx_tables_in_use: 1
trx_tables_locked: 0
trx_lock_structs: 0
trx_lock_memory_bytes: 1136
trx_rows_locked: 0
trx_rows_modified: 0
trx_concurrency_tickets: 0
trx_isolation_level: REPEATABLE READ
trx_unique_checks: 1
trx_foreign_key_checks: 1
trx_last_foreign_key_error: NULL
trx_adaptive_hash_latched: 0
trx_adaptive_hash_timeout: 0
trx_is_read_only: 1
trx_autocommit_non_locking: 1
2 rows in set (0.00 sec)
从库执行 DDL 的线程如下所示。
mysql> select * from information_schema.processlist where info='alter table t3_bak engine=innodb' G
*************************** 1. row ***************************
ID: 5
USER: system user
HOST:
DB: test_zk
COMMAND: Connect
TIME: 74
STATE: Waiting for table metadata lock
INFO: alter table t3_bak engine=innodb
1 row in set (0.00 sec)
显示 performance_schema.threads 表与 information_schema.processlist 表中的 user 不同。
# performance_schema.threads
PROCESSLIST_USER: admin
PROCESSLIST_COMMAND: Connect
# information_schema.processlist
USER: system user
COMMAND: Connect
因此,kill 前按照该 SQL 扫描时可能误伤等锁的 DDL。
为修复该问题,修改扫描 SQL,添加过滤条件,不 kill PROCESSLIST_COMMAND=”Connect” 的连接。
AND t1.PROCESSLIST_COMMAND!="Connect"
那么,为什么两张表中的 user 不同呢?
知识点
user
根据过滤 SQL,kill 元数据锁的任务中从 performance_schema.threads 表中查询 USER 与 COMMAND。
select
t1.PROCESSLIST_DB,
t1.PROCESSLIST_USER,
t1.PROCESSLIST_HOST,
t1.PROCESSLIST_COMMAND
from
performance_schema.threads t1
查看 performance_schema.threads 表,显示有两种 TYPE,21 种 NAME。
mysql> select distinct TYPE from performance_schema.threads;
+------------+
| TYPE |
+------------+
| BACKGROUND |
| FOREGROUND |
+------------+
2 rows in set (0.01 sec)
mysql> select distinct NAME from performance_schema.threads;
+----------------------------------------+
| NAME |
+----------------------------------------+
| thread/sql/main |
| thread/sql/thread_timer_notifier |
| thread/innodb/io_log_thread |
| thread/innodb/io_ibuf_thread |
| thread/innodb/io_read_thread |
| thread/innodb/io_write_thread |
| thread/innodb/page_cleaner_thread |
| thread/innodb/srv_error_monitor_thread |
| thread/innodb/srv_lock_timeout_thread |
| thread/innodb/srv_monitor_thread |
| thread/innodb/srv_master_thread |
| thread/innodb/srv_worker_thread |
| thread/innodb/srv_purge_thread |
| thread/innodb/buf_dump_thread |
| thread/innodb/dict_stats_thread |
| thread/sql/signal_handler |
| thread/sql/compress_gtid_table |
| thread/sql/slave_io |
| thread/sql/slave_sql |
| thread/sql/slave_worker |
| thread/sql/one_connection |
+----------------------------------------+
21 rows in set (0.00 sec)
查看与从库相关的线程。
mysql> select * from performance_schema.threads where NAME like '%slave%' G
*************************** 1. row ***************************
THREAD_ID: 1291939
NAME: thread/sql/slave_io
TYPE: FOREGROUND
PROCESSLIST_ID: 1291913
PROCESSLIST_USER: admin
PROCESSLIST_HOST: x.x.x.x
PROCESSLIST_DB: NULL
PROCESSLIST_COMMAND: Connect
PROCESSLIST_TIME: 280742
PROCESSLIST_STATE: Waiting for master to send event
PROCESSLIST_INFO: NULL
PARENT_THREAD_ID: 1291738
ROLE: NULL
INSTRUMENTED: YES
HISTORY: YES
CONNECTION_TYPE: NULL
THREAD_OS_ID: 14219
*************************** 2. row ***************************
THREAD_ID: 1291940
NAME: thread/sql/slave_sql
TYPE: FOREGROUND
PROCESSLIST_ID: 1291914
PROCESSLIST_USER: admin
PROCESSLIST_HOST: x.x.x.x
PROCESSLIST_DB: NULL
PROCESSLIST_COMMAND: Connect
PROCESSLIST_TIME: 0
PROCESSLIST_STATE: Slave has read all relay log; waiting for more updates
PROCESSLIST_INFO: NULL
PARENT_THREAD_ID: 1291738
ROLE: NULL
INSTRUMENTED: YES
HISTORY: YES
CONNECTION_TYPE: NULL
THREAD_OS_ID: 14220
*************************** 3. row ***************************
THREAD_ID: 1291941
NAME: thread/sql/slave_worker
TYPE: FOREGROUND
PROCESSLIST_ID: 1291915
PROCESSLIST_USER: admin
PROCESSLIST_HOST: x.x.x.x
PROCESSLIST_DB: report
PROCESSLIST_COMMAND: Connect
PROCESSLIST_TIME: 0
PROCESSLIST_STATE: Waiting for an event from Coordinator
PROCESSLIST_INFO: NULL
PARENT_THREAD_ID: 1291940
ROLE: NULL
INSTRUMENTED: YES
HISTORY: YES
CONNECTION_TYPE: NULL
THREAD_OS_ID: 14221
*************************** 4. row ***************************
THREAD_ID: 1291942
NAME: thread/sql/slave_worker
TYPE: FOREGROUND
PROCESSLIST_ID: 1291916
PROCESSLIST_USER: admin
PROCESSLIST_HOST: x.x.x.x
PROCESSLIST_DB: report
PROCESSLIST_COMMAND: Connect
PROCESSLIST_TIME: 0
PROCESSLIST_STATE: Waiting for an event from Coordinator
PROCESSLIST_INFO: NULL
PARENT_THREAD_ID: 1291940
ROLE: NULL
INSTRUMENTED: YES
HISTORY: YES
CONNECTION_TYPE: NULL
THREAD_OS_ID: 14222
*************************** 5. row ***************************
THREAD_ID: 1291943
NAME: thread/sql/slave_worker
TYPE: FOREGROUND
PROCESSLIST_ID: 1291917
PROCESSLIST_USER: admin
PROCESSLIST_HOST: x.x.x.x
PROCESSLIST_DB: NULL
PROCESSLIST_COMMAND: Connect
PROCESSLIST_TIME: 0
PROCESSLIST_STATE: Waiting for an event from Coordinator
PROCESSLIST_INFO: NULL
PARENT_THREAD_ID: 1291940
ROLE: NULL
INSTRUMENTED: YES
HISTORY: YES
CONNECTION_TYPE: NULL
THREAD_OS_ID: 14223
*************************** 6. row ***************************
THREAD_ID: 1291944
NAME: thread/sql/slave_worker
TYPE: FOREGROUND
PROCESSLIST_ID: 1291918
PROCESSLIST_USER: admin
PROCESSLIST_HOST: x.x.x.x
PROCESSLIST_DB: NULL
PROCESSLIST_COMMAND: Connect
PROCESSLIST_TIME: 0
PROCESSLIST_STATE: Waiting for an event from Coordinator
PROCESSLIST_INFO: NULL
PARENT_THREAD_ID: 1291940
ROLE: NULL
INSTRUMENTED: YES
HISTORY: YES
CONNECTION_TYPE: NULL
THREAD_OS_ID: 14224
*************************** 7. row ***************************
THREAD_ID: 1291945
NAME: thread/sql/slave_worker
TYPE: FOREGROUND
PROCESSLIST_ID: 1291919
PROCESSLIST_USER: admin
PROCESSLIST_HOST: x.x.x.x
PROCESSLIST_DB: NULL
PROCESSLIST_COMMAND: Connect
PROCESSLIST_TIME: 0
PROCESSLIST_STATE: Waiting for an event from Coordinator
PROCESSLIST_INFO: NULL
PARENT_THREAD_ID: 1291940
ROLE: NULL
INSTRUMENTED: YES
HISTORY: YES
CONNECTION_TYPE: NULL
THREAD_OS_ID: 14225
*************************** 8. row ***************************
THREAD_ID: 1291946
NAME: thread/sql/slave_worker
TYPE: FOREGROUND
PROCESSLIST_ID: 1291920
PROCESSLIST_USER: admin
PROCESSLIST_HOST: x.x.x.x
PROCESSLIST_DB: NULL
PROCESSLIST_COMMAND: Connect
PROCESSLIST_TIME: 0
PROCESSLIST_STATE: Waiting for an event from Coordinator
PROCESSLIST_INFO: NULL
PARENT_THREAD_ID: 1291940
ROLE: NULL
INSTRUMENTED: YES
HISTORY: YES
CONNECTION_TYPE: NULL
THREAD_OS_ID: 14226
*************************** 9. row ***************************
THREAD_ID: 1291947
NAME: thread/sql/slave_worker
TYPE: FOREGROUND
PROCESSLIST_ID: 1291921
PROCESSLIST_USER: admin
PROCESSLIST_HOST: x.x.x.x
PROCESSLIST_DB: NULL
PROCESSLIST_COMMAND: Connect
PROCESSLIST_TIME: 8
PROCESSLIST_STATE: Waiting for an event from Coordinator
PROCESSLIST_INFO: NULL
PARENT_THREAD_ID: 1291940
ROLE: NULL
INSTRUMENTED: YES
HISTORY: YES
CONNECTION_TYPE: NULL
THREAD_OS_ID: 14227
*************************** 10. row ***************************
THREAD_ID: 1291948
NAME: thread/sql/slave_worker
TYPE: FOREGROUND
PROCESSLIST_ID: 1291922
PROCESSLIST_USER: admin
PROCESSLIST_HOST: x.x.x.x
PROCESSLIST_DB: NULL
PROCESSLIST_COMMAND: Connect
PROCESSLIST_TIME: 8
PROCESSLIST_STATE: Waiting for an event from Coordinator
PROCESSLIST_INFO: NULL
PARENT_THREAD_ID: 1291940
ROLE: NULL
INSTRUMENTED: YES
HISTORY: YES
CONNECTION_TYPE: NULL
THREAD_OS_ID: 14228
10 rows in set (0.00 sec)
显示复制线程的 USER 与 COMMAND 如下所示。
PROCESSLIST_USER: admin
PROCESSLIST_COMMAND: Connect
之前有人提 bug #77293 Wrong processlist_user in threads table for replication threads,performance_schema.threads 表中显示的 user 为启动复制的登录用户名,存在歧义,建议将该内部线程的 user 调整为 system user,但是官方并未采纳,经验证,高版本 8.0.31 中依然存在该问题。
官方文档中显示 PROCESSLIST_USER 是与复制线程相关联的用户,而实际上该用户就是执行start slave
命令启动复制的连接使用的登陆账号。
PROCESSLIST_USER
The user associated with a foreground thread,
NULL
for a background thread.
那么,执行show slave status
命令时返回的 user 保存在哪里呢?
具体是保存在 performance_schema.replication_connection_configuration 数据表中,在执行change master to
命令搭建复制后持久化保存。
mysql> select * from performance_schema.replication_connection_configuration G
*************************** 1. row ***************************
CHANNEL_NAME:
HOST: 127.0.0.1
PORT: 3341
USER: replicater
NETWORK_INTERFACE:
AUTO_POSITION: 0
SSL_ALLOWED: NO
SSL_CA_FILE:
SSL_CA_PATH:
SSL_CERTIFICATE:
SSL_CIPHER:
SSL_KEY:
SSL_VERIFY_SERVER_CERTIFICATE: NO
SSL_CRL_FILE:
SSL_CRL_PATH:
CONNECTION_RETRY_INTERVAL: 60
CONNECTION_RETRY_COUNT: 86400
HEARTBEAT_INTERVAL: 30.000
TLS_VERSION:
1 row in set (0.00 sec)
到这里回答了上面提到的以下三个问题中的第一个问题,下面分析后两个问题。
-
DDL 被阻塞,可是为什么先 kill DDL,而且 DDL 的 user 显示 admin,不应该是 replicater 或 system user 吗? -
为什么先 kill DDL 后 kill select,而且备份的执行时间显示 2023-04-21 02:12:12,执行耗时显示 0? -
为什么显示的备份 SQL 与 DDL 不是同一张表?
time
根据扫描 SQL,分别从 information_schema.innodb_trx 与 information_schema.processlist 中获取事务的开始时间与执行用时。
select
t3.trx_started,
t4.TIME
from
information_schema.innodb_trx t3 on t3.trx_mysql_thread_id = t1.PROCESSLIST_ID
join information_schema.processlist t4 on t4.ID = t3.trx_mysql_thread_id;
如下所示,一个事务中发起多次查询操作,分别查看 processlist 与 innodb_trx 表中的时间字段。
session 1 | session 2 |
---|---|
begin; select * from t3_bak limit 1; |
|
select time from information_schema.processlist; 24 select trx_started from information_schema.innodb_trx 2023-04-25 19:29:41 |
|
select * from t3_bak limit 2; | |
select time from information_schema.processlist; 3 select trx_started from information_schema.innodb_trx 2023-04-25 19:29:41 |
显示:
-
processlist.time,同一个事务中多次查询从头开始计时; -
innodb_trx.trx_started,同一个事务中多次查询保持不变。
因此 innodb_trx.trx_started 有参考性,processlist.time 无参考性。
以下两个线程被 kill 时相关信息如下所示。
# DDL
执行时间:2023-04-21 02:18:16
执行耗时:2575
kill SQL:alter table ob_check_worker
# 备份
执行时间:2023-04-21 02:12:12
执行耗时:0
kill SQL:SELECT FROM ob_shipment_backdata
其中:
-
备份的执行时间更早,原因是备份事务先开启; -
备份的执行耗时更短,原因是备份中同一个事务执行多次查询; -
备份 SQL 与 DDL 不是同一张表,原因是扫描 SQL 执行时在备份其他表,DDL 操作的表已备份,锁未释放。
然后分析最后一个问题,备份为什么会阻塞 DDL,备份会全程阻塞 DDL 吗?
mydumper
公司使用的备份工具是 mydumper,相比于物理备份,缺点是慢,优点是节省空间。
mydumper 命令示例如下所示。
mydumper
--host=127.0.0.1
--port=3341
--user=admin
--password=3341
--compress
--threads=6
--build-empty-files
--trx-consistency-only --rows=10000
--long-query-guard=120
--statement-size=4000000
--outputdir=/export/
备份最重要的是保证数据一致性,命令中指定 trx-consistency-only,因此可以将备份理解为一个大事务,因此会阻塞备份期间执行的 DDL。
下面测试验证。
session1 | mydumper | session 2 |
---|---|---|
start dump | ||
alter table t1 drop column c; | ||
Waiting for table metadata lock | ||
finish dump | ||
Query OK, 0 rows affected (10.39 sec) |
开启 general log
mysql> show variables like '%gene%';
+------------------+----------------------------------------------------+
| Variable_name | Value |
+------------------+----------------------------------------------------+
| general_log | OFF |
| general_log_file | /export/zhangkai321/mysql/3341/data/test.log |
+------------------+----------------------------------------------------+
2 rows in set (0.01 sec)
mysql> set global general_log=1;
Query OK, 0 rows affected (0.00 sec)
准备一张小表,用于 DDL
mysql> show table status like 't1' G
*************************** 1. row ***************************
Name: t1
Engine: InnoDB
Version: 10
Row_format: Dynamic
Rows: 3
Avg_row_length: 5461
Data_length: 16384
Max_data_length: 0
Index_length: 16384
Data_free: 0
Auto_increment: NULL
Create_time: 2023-04-24 16:08:58
Update_time: NULL
Check_time: NULL
Collation: utf8_general_ci
Checksum: NULL
Create_options:
Comment:
1 row in set (0.00 sec)
备份,其中指定单线程备份
[root@test export]# mydumper --host=127.0.0.1 --port=3341 --user=admin --password=3341 --compress --threads=1 --build-empty-files --trx-consistency-only --rows=10000 --long-query-guard=120 --statement-size=4000000
** (mydumper:30489): WARNING **: Using trx_consistency_only, binlog coordinates will not be accurate if you are writing to non transactional tables.
查看 general log
开始备份
开始 DDL
DDL 执行期间阻塞,执行show processlist
命令显示元数据锁等待,因此复现备份导致锁表。
mysql> SHOW PROCESSLIST;
+----+-------+-----------------+---------+---------+------+---------------------------------+------------------------------------------------------------------------------------------------------+
| Id | User | Host | db | Command | Time | State | Info |
+----+-------+-----------------+---------+---------+------+---------------------------------+------------------------------------------------------------------------------------------------------+
| 40 | admin | 127.0.0.1:35374 | test_zk | Query | 10 | Waiting for table metadata lock | alter table t1 drop column c |
| 41 | admin | 127.0.0.1:37941 | NULL | Query | 0 | starting | SHOW PROCESSLIST |
| 43 | admin | 127.0.0.1:38641 | NULL | Query | 0 | Sending data | SELECT /*!40001 SQL_NO_CACHE */ * FROM `test_zk`.`t3_bak` WHERE (`id` >= 8975507 AND `id` < 8985753) |
+----+-------+-----------------+---------+---------+------+---------------------------------+------------------------------------------------------------------------------------------------------+
3 rows in set (0.00 sec)
备份完成
查看备份元数据文件。
[root@test export]# cat export-20230424-160845/metadata
Started dump at: 2023-04-24 16:08:45
SHOW MASTER STATUS:
Log: mysql-bin.000003
Pos: 18976
GTID:d16716c7-c94e-11ed-a186-fa163e19c3b7:1-70
Finished dump at: 2023-04-24 16:08:58
梳理下时间线:
-
16:08:45,start dump -
16:08:45.881375,FLUSH TABLES WITH READ LOCK -
16:08:45.884198,START TRANSACTION /*!40108 WITH CONSISTENT SNAPSHOT */ -
16:08:45.885019,START TRANSACTION /*!40108 WITH CONSISTENT SNAPSHOT */ -
16:08:45.885167,UNLOCK TABLES /* trx-only */ -
16:08:47.694331,SELECT /*!40001 SQL_NO_CACHE */ * FROM test_zk
.t1
-
16:08:47.772229,alter table t1 drop column c,start ddl -
16:08:58.113131,Quit,finish dump -
16:08:58.17,finish ddl
这里有两个问题:
-
general log 中记录的 DDL 的时间是开始时间吗? -
DDL 是在备份完成后执行成功的吗?
经验证,DDL 开始后收到请求 general log 中立即打印,然后发生元数据锁等待。备份完成后,事务提交写入 binlog。
执行mysqlbinlog --base64-output=decode-row -vv mysql-bin.000003
命令分析 binlog。
#230424 16:08:47 server id 3341 end_log_pos 19149 CRC32 0x034e63fe Query thread_id=40 exec_time=11 error_code=0
SET TIMESTAMP=1682323727/*!*/;
alter table t1 drop column c
/*!*/;
其中:
-
DDL 开始时间,230424 16:08:47 -
DDL 执行用时,exec_time=11
因此:
-
binlog 中记录的并不是文件的更新时间,而是事务的开始时间,并记录执行用时,计算可得事务的结束时间; -
备份可以理解为一个大事务,因此会阻塞备份期间执行的 DDL。
结论
告警显示从库复制线程执行 DDL 被 kill,表明存在两个操作:
-
DML,持有元数据锁导致 DDL 被阻塞; -
kill 操作,导致复制中断。
实际上是备份阻塞了 DDL,随候自动 kill 元数据锁的任务 kill 了从库的两个线程,包括复制线程的 DDL 以及备份任务。
发生元数据锁等待时理论上只需要 kill 备份,但是也 kill 掉了 DDL。
因此存在以下四个问题:
-
DDL 被阻塞,可是为什么先 kill DDL,而且 DDL 的 user 显示 admin,不应该是 replicater 或 system user 吗? -
为什么先 kill DDL 后 kill select,而且备份的执行时间显示 2023-04-21 02:12:12,执行耗时显示 0? -
为什么显示的备份 SQL 与 DDL 不是同一张表? -
为什么备份阻塞 DDL?
依次分析原因:
-
被 kill 的复制线程的 user 显示 admin 的原因是使用 admin 创建连接后启动复制,而 performance_schema.threads 表与 information_schema.processlist 表中的 user 不同; -
DDL 被备份阻塞,但是显示的执行耗时比备份早的原因是 information_schema.innodb_trx 与 information_schema.processlist 中记录的时间不同,前者记录事务的开始时间,同一个事务中多次查询保持不变,后者记录单词查询的开始时间,同一个事务中多次查询从头开始计时; -
备份 SQL 与 DDL 不是同一张表的原因是备份事务中先备份 DDL 操作的表,然后开始 DDL,DDL 结束之前备份到了其他表,但是由于未释放锁因此阻塞 DDL; -
备份阻塞 DDL 的原因是 mydumper 的备份过程可以理解为一个大事务,DDL 开始后 general log 就有记录,但是直到事务提交才写入 binlog,其中记录事务的开始时间与执行用时。
待办
-
mydumper 原理 -
mysql 元数据锁的作用
参考教程
-
MySQL 中的-Error_code:1317-爱可生
-
Bug #77293 Wrong processlist_user in threads table for replication threads
原文始发于微信公众号(丹柿小院):MySQL 从库备份阻塞 DDL 后复制线程被 kill 导致复制中断
版权声明:本文内容由互联网用户自发贡献,该文观点仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 举报,一经查实,本站将立刻删除。
文章由极客之音整理,本文链接:https://www.bmabk.com/index.php/post/178536.html