一次MySQL Hang死故障分析过程
之前的一篇文章中介绍了如何去看信号量等待日志,这篇结合一个实际例子去看下
背景介绍
一台线上MySQL实例运行中Hang死情况,导致数据库不可用, 当时对应的 show engine innodb statusG日志如下:
020-11-25T07:45:35.437594+08:00 2887925 [Note] Aborted connection 2887925 to db: 'xxx' user: 'xxx' host: 'xxxx' (Got an error reading communication packets)
2020-11-25T09:33:15.302725+08:00 0 [Warning] InnoDB: A long semaphore wait:
--Thread 140477271320320 has waited at ibuf0ibuf.cc line 3439 for 241.00 seconds the semaphore:
S-lock on RW-latch at 0x7fc5a5faa520 created in file buf0buf.cc line 1460
a writer (thread id 140477635114752) has reserved it in mode exclusive
number of readers 0, waiters flag 1, lock_word: fffffffff0000000
Last time read locked in file ibuf0ibuf.cc line 4552
Last time write locked in file /dba/app/mysql_home/5.7.19/storage/innobase/ibuf/ibuf0ibuf.cc line 406
2020-11-25T09:33:15.302783+08:00 0 [Warning] InnoDB: A long semaphore wait:
--Thread 140477265372928 has waited at btr0cur.cc line 976 for 241.00 seconds the semaphore:
SX-lock on RW-latch at 0x7fc3189b57b0 created in file dict0dict.cc line 2687
a writer (thread id 140477291595520) has reserved it in mode SX
number of readers 0, waiters flag 1, lock_word: 10000000
Last time read locked in file btr0cur.cc line 1008
Last time write locked in file /dba/app/mysql_home/5.7.19/storage/innobase/btr/btr0cur.cc line 976
2020-11-25T09:33:15.302798+08:00 0 [Warning] InnoDB: A long semaphore wait:
--Thread 140477883373312 has waited at ibuf0ibuf.cc line 2518 for 241.00 seconds the semaphore:
S-lock on RW-latch at 0x7fc5a5faa520 created in file buf0buf.cc line 1460
a writer (thread id 140477635114752) has reserved it in mode exclusive
number of readers 0, waiters flag 1, lock_word: fffffffff0000000
Last time read locked in file ibuf0ibuf.cc line 4552
Last time write locked in file /dba/app/mysql_home/5.7.19/storage/innobase/ibuf/ibuf0ibuf.cc line 406
数据库中报大量的信号量等待,时间均在240s以上
故障分析:
从日志最开始报信号量等待看起:
--Thread 140477271320320 has waited at ibuf0ibuf.cc line 3439 for 241.00 seconds the semaphore:
S-lock on RW-latch at 0x7fc5a5faa520 created in file buf0buf.cc line 1460
a writer (thread id 140477635114752) has reserved it in mode exclusive
number of readers 0, waiters flag 1, lock_word: fffffffff0000000
Last time read locked in file ibuf0ibuf.cc line 4552
Last time write locked in file /dba/app/mysql_home/5.7.19/storage/innobase/ibuf/ibuf0ibuf.cc line 406
从上述信息可以得出以下内容:
线程140477271320320 在:ibuf0ibuf.cc line 3439
添加S锁时 发生了等待
锁创建在:buf0buf.cc line 1460
这里可以再看下之前《MySQL信号量等待日志阅读》这篇
线程140477635114752以独占方式,占有了这个锁
lock_word: fffffffff0000000 = - 268435456
此问题中 lock_word: fffffffff0000000 = - 268435456
即- X_LOCK_HALF_DECR,代表添加了X和SX锁,线程
140477271320320再次申请S锁时就发生了等待。
已知线程140477271320320等待线程140477635114752,查询线程140477635114752持有锁和锁等待情况:
--Thread 140477635114752 has waited at btr0sea.ic line 128 for 241.00 seconds the semaphore:
S-lock on RW-latch at 0xffbbc08 created in file btr0sea.cc line 195
a writer (thread id 140458712454912) has reserved it in mode wait exclusive
number of readers 1, waiters flag 1, lock_word: ffffffffffffffff
Last time read locked in file btr0sea.ic line 146
Last time write locked in file /dba/app/mysql_home/5.7.19/storage/innobase/btr/btr0sea.cc line 1300
lock_word: ffffffffffffffff = -1
,说明存在S类型的锁且有写者在等待,等待的线程是140458712454912
继续查看140458712454912持有锁情况
--Thread 140458712454912 has waited at btr0cur.cc line 3874 for 241.00 seconds the semaphore:
X-lock (wait_ex) on RW-latch at 0xffbbc08 created in file btr0sea.cc line 195
a writer (thread id 140458712454912) has reserved it in mode wait exclusive
number of readers 1, waiters flag 1, lock_word: ffffffffffffffff
Last time read locked in file btr0sea.ic line 146
Last time write locked in file /dba/app/mysql_home/5.7.19/storage/innobase/btr/btr0sea.cc line 1300
140458712454912线程要申请latch的x-lock发生了等待
--Thread 140458712454912 has waited at btr0cur.cc line 3874 for 272.00 seconds the semaphore:
rw_lock_x_lock(btr_get_search_latch(index));
锁创建在:
X-lock (wait_ex) on RW-latch at 0xffbbc08 created in file btr0sea.cc line 195
a writer (thread id 140458712454912) has reserved it in mode wait exclusive
对应代码如下:
rw_lock_create(btr_search_latch_key,
btr_search_latches[i], SYNC_SEARCH_SYS);
/** Latches protecting access to adaptive hash index. */
extern rw_lock_t** btr_search_latches;
根据信息知道是一个AHI的latch,是在数据库启动时候创建的
这个锁最后一次读取和写入分别是在:
Last time read locked in file btr0sea.ic line 146
对应的函数是btr_search_s_lock_all
Last time write locked in file /dba/app/mysql_home/5.7.19/storage/innobase/btr/btr0sea.cc line 1300
对应的函数是btr_search_drop_page_hash_index
但不是140458712454912读取和写入这个锁,因为 lock_word: ffffffffffffffff = -1
,说明存在S类型的锁且写者在等待,等待的线程就是140458712454912。
查看错误日志发现,有一个truncate操作,与客户工程师确认,truncate的表100G,而且是在truncate操作执行之后,数据库就开始处于hang住状态
原因分析
通过gdb跟踪truncate 堆栈信息,发现truncate操作时会扫描page并删除对应的自适应hash索引:
#0 btr_search_drop_page_hash_index (block=0x7f186bb88760) at /mydata/Project/mysql-server/storage/innobase/btr/btr0sea.cc:1175
#1 0x0000000001b30adc in btr_search_drop_page_hash_when_freed (page_id=..., page_size=...)
at /mydata/Project/mysql-server/storage/innobase/btr/btr0sea.cc:1387
#2 0x0000000001be8b2c in fseg_free_extent (seg_inode=0x7f1863d400f2 "", space=37, page_size=..., page=3072, ahi=true,
mtr=0x7f192027c3b0) at /mydata/Project/mysql-server/storage/innobase/fsp/fsp0fsp.cc:3795
#3 0x0000000001be90e2 in fseg_free_step (header=0x7f18713d004a "", ahi=true, mtr=0x7f192027c3b0)
at /mydata/Project/mysql-server/storage/innobase/fsp/fsp0fsp.cc:3887
#4 0x0000000001afdfe8 in btr_free_but_not_root (block=0x7f186bf7c0a0, log_mode=MTR_LOG_NO_REDO)
at /mydata/Project/mysql-server/storage/innobase/btr/btr0btr.cc:1152
#5 0x0000000001afe18a in btr_free_if_exists (page_id=..., page_size=..., index_id=55, mtr=0x7f192027cae0)
at /mydata/Project/mysql-server/storage/innobase/btr/btr0btr.cc:1200
#6 0x0000000001b82112 in dict_drop_index_tree (rec=0x7f190ba68cfd "", pcur=0x7f192027cff0, mtr=0x7f192027cae0)
at /mydata/Project/mysql-server/storage/innobase/dict/dict0crea.cc:1160
#7 0x0000000001a4f926 in DropIndex::operator() (this=0x7f192027d150, mtr=0x7f192027cae0, pcur=0x7f192027cff0)
at /mydata/Project/mysql-server/storage/innobase/row/row0trunc.cc:947
#8 0x0000000001a57bd5 in IndexIterator::for_each<DropIndex> (this=0x7f192027cae0, callback=...)
at /mydata/Project/mysql-server/storage/innobase/row/row0trunc.cc:102
#9 0x0000000001a5604d in SysIndexIterator::for_each<DropIndex> (this=0x7f192027d304, callback=...)
at /mydata/Project/mysql-server/storage/innobase/row/row0trunc.cc:160
#10 0x0000000001a51b3c in row_truncate_table_for_mysql (table=0x7f17b0947c00, trx=0x7f1921d6e8c0)
at /mydata/Project/mysql-server/storage/innobase/row/row0trunc.cc:2007
#11 0x00000000018be415 in ha_innobase::truncate (this=0x7f17b0de3a30)
at /mydata/Project/mysql-server/storage/innobase/handler/ha_innodb.cc:12382
#12 0x0000000000f2d0e3 in handler::ha_truncate (this=0x7f17b0de3a30) at /mydata/Project/mysql-server/sql/handler.cc:4675
#13 0x000000000176c94d in Sql_cmd_truncate_table::handler_truncate (this=0x7f17b0006680, thd=0x7f17b0000b70,
table_ref=0x7f17b00060f0, is_tmp_table=false) at /mydata/Project/mysql-server/sql/sql_truncate.cc:244
#14 0x000000000176d1ae in Sql_cmd_truncate_table::truncate_table (this=0x7f17b0006680, thd=0x7f17b0000b70,
table_ref=0x7f17b00060f0) at /mydata/Project/mysql-server/sql/sql_truncate.cc:504
#15 0x000000000176d315 in Sql_cmd_truncate_table::execute (this=0x7f17b0006680, thd=0x7f17b0000b70)
at /mydata/Project/mysql-server/sql/sql_truncate.cc:560
#16 0x0000000001533b80 in mysql_execute_command (thd=0x7f17b0000b70, first_level=true)
at /mydata/Project/mysql-server/sql/sql_parse.cc:4856
#17 0x00000000015358ce in mysql_parse (thd=0x7f17b0000b70, parser_state=0x7f192027e690)
同时在删除表上对应的buffer_pool中的page时,需要对所有的AHI latch添加S类型锁。
/* Lock the search latch in shared mode to prevent user
from disabling AHI during the scan */
btr_search_s_lock_all();
DEBUG_SYNC_C("simulate_buffer_pool_scan");
buf_LRU_flush_or_remove_pages(id, BUF_REMOVE_ALL_NO_WRITE, 0);
btr_search_s_unlock_all();
删除的表比较大,所以会频繁的在AHI latch上添加x-lock和s-lock,140458712454912线程同样在latch上添加x-lock时,就
会进入自旋状态,通过错误日志中内容,可以看到数据库自旋状态非常严重:
OS WAIT ARRAY INFO: signal count 81353664
4191: RW-shared spins 0, rounds 192090016, OS waits 65971716
4192: RW-excl spins 0, rounds 294354293, OS waits 4971351
4193: RW-sx spins 2999653, rounds 34961484, OS waits 846918
140458712454912线程通过日志查看,是update操作,gdb调试看到,当buffpool不够用时,会LRU链表中释放page,这时
就需要删除AHI自适应hash索引
0 btr_search_drop_page_hash_index (block=0x7f6303580678) at /mydata/Project/mysql-server/storage/innobase/btr/btr0sea.cc:1300
#1 0x0000000001b6fa18 in buf_LRU_free_page (bpage=0x7f6303580678, zip=true)
at /mydata/Project/mysql-server/storage/innobase/buf/buf0lru.cc:2062
#2 0x0000000001b6d038 in buf_LRU_free_from_common_LRU_list (buf_pool=0x44af3d8, scan_all=false)
at /mydata/Project/mysql-server/storage/innobase/buf/buf0lru.cc:1078
#3 0x0000000001b6d2a7 in buf_LRU_scan_and_free_block (buf_pool=0x44af3d8, scan_all=false)
at /mydata/Project/mysql-server/storage/innobase/buf/buf0lru.cc:1120
#4 0x0000000001b6dd59 in buf_LRU_get_free_block (buf_pool=0x44af3d8)
at /mydata/Project/mysql-server/storage/innobase/buf/buf0lru.cc:1353
#5 0x0000000001b4d843 in buf_page_init_for_read (err=0x7f62f838c1b4, mode=132, page_id=..., page_size=..., unzip=0)
at /mydata/Project/mysql-server/storage/innobase/buf/buf0buf.cc:5145
#6 0x0000000001b74299 in buf_read_page_low (err=0x7f62f838c1b4, sync=false, type=64, mode=132, page_id=..., page_size=...,
unzip=false) at /mydata/Project/mysql-server/storage/innobase/buf/buf0rea.cc:149
#7 0x0000000001b755bc in buf_read_ahead_linear (page_id=..., page_size=..., inside_ibuf=0)
at /mydata/Project/mysql-server/storage/innobase/buf/buf0rea.cc:723
#8 0x0000000001b4c077 in buf_page_get_gen (page_id=..., page_size=..., rw_latch=1, guess=0x0, mode=10,
file=0x2185b88 "/mydata/Project/mysql-server/storage/innobase/btr/btr0pcur.cc", line=448, mtr=0x7f62f838cdb0,
dirty_with_no_latch=false) at /mydata/Project/mysql-server/storage/innobase/buf/buf0buf.cc:4651
#9 0x0000000001b261c7 in btr_block_get_func (page_id=..., page_size=..., mode=1,
file=0x2185b88 "/mydata/Project/mysql-server/storage/innobase/btr/btr0pcur.cc", line=448, index=0x7f629c019d10,
mtr=0x7f62f838cdb0) at /mydata/Project/mysql-server/storage/innobase/include/btr0btr.ic:63
#10 0x0000000001b27f55 in btr_pcur_move_to_next_page (cursor=0x7f627003c628, mtr=0x7f62f838cdb0)
at /mydata/Project/mysql-server/storage/innobase/btr/btr0pcur.cc:448
#11 0x0000000001a3be3a in btr_pcur_move_to_next (cursor=0x7f627003c628, mtr=0x7f62f838cdb0)
at /mydata/Project/mysql-server/storage/innobase/include/btr0pcur.ic:360
#12 0x0000000001a48b1c in row_search_mvcc (buf=0x7f627003bb00 "37736225106", mode=PAGE_CUR_G, prebuilt=0x7f627003c410,
match_mode=0, direction=1) at /mydata/Project/mysql-server/storage/innobase/row/row0sel.cc:6079
#13 0x00000000018b8d5a in ha_innobase::general_fetch (this=0x7f627003b6f0, buf=0x7f627003bb00 "37736225106", direction=1,
match_mode=0) at /mydata/Project/mysql-server/storage/innobase/handler/ha_innodb.cc:8996
#14 0x00000000018b8f83 in ha_innobase::index_next (this=0x7f627003b6f0, buf=0x7f627003bb00 "37736225106")
at /mydata/Project/mysql-server/storage/innobase/handler/ha_innodb.cc:9065
#15 0x0000000000f29627 in handler::ha_index_next (this=0x7f627003b6f0, buf=0x7f627003bb00 "37736225106")
#16 0x0000000001456e37 in rr_index (info=0x7f62f838de00) at /mydata/Project/mysql-server/sql/records.cc:475
#17 0x00000000015e2a20 in mysql_update (thd=0x7f6270000b70, fields=..., values=..., limit=18446744073709114893,
handle_duplicates=DUP_ERROR, found_return=0x7f62f838e458, updated_return=0x7f62f838e450)
at /mydata/Project/mysql-server/sql/sql_update.cc:689
#18 0x00000000015e9428 in Sql_cmd_update::try_single_table_update (this=0x7f6270006410, thd=0x7f6270000b70,
switch_to_multitable=0x7f62f838e4ff) at /mydata/Project/mysql-server/sql/sql_update.cc:2883
#19 0x00000000015e9959 in Sql_cmd_update::execute (this=0x7f6270006410, thd=0x7f6270000b70)
at /mydata/Project/mysql-server/sql/sql_update.cc:3010
#20 0x000000000153018e in mysql_execute_command (thd=0x7f6270000b70, first_level=true)
at /mydata/Project/mysql-server/sql/sql_parse.cc:3579
#21 0x00000000015358ce in mysql_parse (thd=0x7f6270000b70, parser_state=0x7f62f838f690)
at /mydata/Project/mysql-server/sql/sql_parse.cc:5574
#22 0x000000000152b1a0 in dispatch_command (thd=0x7f6270000b70, com_data=0x7f62f838fdf0, command=COM_QUERY)
at /mydata/Project/mysql-server/sql/sql_parse.cc:1461
#23 0x000000000152a092 in do_command (thd=0x7f6270000b70) at /mydata/Project/mysql-server/sql/sql_parse.cc:999
#24 0x0000000001659c80 in handle_connection (arg=0x5180df0)
at /mydata/Project/mysql-server/sql/conn_handler/connection_handler_per_thread.cc:300
#25 0x0000000001cddf8b in pfs_spawn_thread (arg=0x5ae6b80) at /mydata/Project/mysql-server/storage/perfschema/pfs.cc:2188
#26 0x00007f6317173dd5 in start_thread (arg=0x7f62f8390700) at pthread_create.c:307
#27 0x00007f6315e37ead in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:11
从现象来看,是应为truncate 时候,释放AHI需要频繁添加x-lock,同时其它线程也需要添加x-lock,这时发生了阻塞现象。
原文始发于微信公众号(懒人的记录):一次MySQL Hang死故障分析过程
版权声明:本文内容由互联网用户自发贡献,该文观点仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 举报,一经查实,本站将立刻删除。
文章由极客之音整理,本文链接:https://www.bmabk.com/index.php/post/125565.html