引言

本文介绍了一个大量信号量等待导致数据库 hang 进而连接打满的案例，错误日志显示主线程 self lock，并且状态为 state: enforcing dict cache limit。

结合源码分析后怀疑是主线程刷脏持锁 RW-latch，定时任务清理表定义缓存获取 dict_operation_lock 的 X-lock 时阻塞。

其中关于信号量与缓存池的原理待后续详细分析，能力有限，欢迎批评指正。

现象

时间：2023-05-17T08:56

现象：数据库连接打满，错误日志显示刷脏慢，并且存在大量信号量等待，怀疑数据库 hang。

监控

连接打满

锁等待时间飙高，并产生大量慢sql

慢sql

连接

查看当前连接，update、insert 执行用时超过1000s，并且 STATE 为 updating。

数据库最大连接数为 32000。

mysql> show variables like '%connections%';
+----------------------+-------+
| Variable_name        | Value |
+----------------------+-------+
| max_connections      | 32000 |
| max_user_connections | 32000 |
+----------------------+-------+
2 rows in set (0.00 sec)

错误日志

以下为部分错误日志，显示刷脏慢与信号量等待。

[2023-05-17]
2023-05-17T08:51:47.712549+08:00 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 6738ms. The settings might not be optimal. (flushed=36000 and evicted=0, during the time.)
2023-05-17T08:55:39.987489+08:00 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 208272ms. The settings might not be optimal. (flushed=34870 and evicted=0, during the time.)

2023-05-17T08:56:12.624302+08:00 0 [Warning] InnoDB: A long semaphore wait:
--Thread 140366569264896 has waited at srv0srv.cc line 1989 for 241  seconds the semaphore:
 # rw_lock_x_lock(dict_operation_lock);
X-lock (wait_ex) on RW-latch at 0x7facfe7cb478 created in file dict0dict.cc line 1228
 # rw_lock_create(dict_operation_lock_key, dict_operation_lock, SYNC_DICT_OPERATION);
a writer (thread id 140366569264896) has reserved it in mode  wait exclusive


--Thread 140356247877376 has waited at row0undo.cc line 330 for 75  seconds the semaphore:
 # row_mysql_freeze_data_dictionary(trx);
S-lock on RW-latch at 0x7facfe7cb478 created in file dict0dict.cc line 1228
 # rw_lock_create(dict_operation_lock_key, dict_operation_lock, SYNC_DICT_OPERATION);
a writer (thread id 140366569264896) has reserved it in mode  wait exclusive

2023-05-17T09:03:35.247257+08:00 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 472259ms. The settings might not be optimal. (flushed=15090 and evicted=0, during the time.)
2023-05-17T09:04:18.891009+08:00 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 41643ms. The settings might not be optimal. (flushed=6594 and evicted=0, during the time.)
2023-05-17T09:04:45.085207+08:00 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 24194ms. The settings might not be optimal. (flushed=2566 and evicted=0, during the time.)

下面详细分析错误日志。

分析

page cleaner 超时异常

显示刷脏慢，从 flush_list 刷新 36000 个脏页用时超过 6s。

2023-05-17T08:51:47.712549+08:00 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 6738ms. The settings might not be optimal. (flushed=36000 and evicted=0, during the time.)

8:51出现页清理线程超时的告警，此为每秒的任务，超过 4 秒会触发告警，本次用时 6.7 秒。

2023-05-17T08:55:39.987489+08:00 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 208272ms. The settings might not be optimal. (flushed=34870 and evicted=0, during the time.)

8:55:39 再次出现页清理线程超时的告警，本次用时 208秒（3分28秒），即 8:52:11发起的页清理任务耗时 208 秒后才执行完成，一般为磁盘 IO 性能问题导致。

之后也多次出现刷脏慢，原本 1s 内可以完成的刷新甚至用时 7 分钟。

2023-05-17T09:03:35.247257+08:00 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 472259ms. The settings might not be optimal. (flushed=15090 and evicted=0, during the time.)
2023-05-17T09:04:18.891009+08:00 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 41643ms. The settings might not be optimal. (flushed=6594 and evicted=0, during the time.)
2023-05-17T09:04:45.085207+08:00 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 24194ms. The settings might not be optimal. (flushed=2566 and evicted=0, during the time.)

线程阻塞

显示信号量等待 InnoDB: A long semaphore wait。

2023-05-17T08:56:12.624302+08:00 0 [Warning] InnoDB: A long semaphore wait:
--Thread 140366569264896 has waited at srv0srv.cc line 1989 for 241  seconds the semaphore:
X-lock (wait_ex) on RW-latch at 0x7facfe7cb478 created in file dict0dict.cc line 1228
a writer (thread id 140366569264896) has reserved it in mode  wait exclusive
number of readers 4, waiters flag 1, lock_word: fffffffffffffffc
Last time read locked in file row0undo.cc line 330
Last time write locked in file /export/wyl/mysql-5.7.33/storage/innobase/srv/srv0srv.cc line 1989
InnoDB: ###### Starts InnoDB Monitor for 30 secs to print diagnostic info:
InnoDB: Pending preads 0, pwrites 0
2023-05-17T08:56:13.972060+08:00 624285 [Note] Got an error reading communication packets

8:56:12 出现信号量等待告警，Thread 140366569264896 在 srv0srv.cc:1989 行等待信号 241 秒（4分1秒—即8:52:11），该信号是在 dict0dict.cc:1228 行创建 RW-latch 排它锁，具体是给字典缓存加排他锁。

但是被另一个等待加排他锁的线程阻塞（a writer (thread id 140366569264896），表明该线程持有 RW-latch 排它锁，但是并未记录持锁的具体方法。可见持锁的线程也是等锁的线程，表明该线程 self lock。

由于开始等待时间与 page cleaner 超时告警的开始时间基本一致，因此怀疑 a writer 被 IO 抖动导致异常或 a writer 就是 page cleaner。

等锁的方法为：

1989 rw_lock_x_lock(dict_operation_lock);

1228 rw_lock_create(dict_operation_lock_key,dict_operation_lock, SYNC_DICT_OPERATION);

根据源码，1989 rw_lock_x_lock() 方法在 srv_master_evict_from_table_cache() 内部调用，并在这里等待了 241s。

/********************************************************************//**
Make room in the table cache by evicting an unused table.
@return number of tables evicted. */
static
ulint
srv_master_evict_from_table_cache(
/*==============================*/
 ulint pct_check) /*!< in: max percent to check */
{
 ulint n_tables_evicted = 0;

1989 rw_lock_x_lock(dict_operation_lock);

 dict_mutex_enter_for_mysql();

 n_tables_evicted = dict_make_room_in_cache(
  innobase_get_table_cache_size(), pct_check);

 dict_mutex_exit_for_mysql();

 rw_lock_x_unlock(dict_operation_lock);

 return(n_tables_evicted);
}

dict_operation_lock 是在 dict0dict.cc:dict_init() 方法中创建，其中调用 1228 rw_lock_create() 方法，具体是在这里发生锁等待。

/**********************************************************************//**
Inits the data dictionary module. */
void
dict_init(void)
/*===========*/
{
 dict_operation_lock = static_cast<rw_lock_t*>(
  ut_zalloc_nokey(sizeof(*dict_operation_lock)));

 dict_sys = static_cast<dict_sys_t*>(ut_zalloc_nokey(sizeof(*dict_sys)));

 UT_LIST_INIT(dict_sys->table_LRU, &dict_table_t::table_LRU);
 UT_LIST_INIT(dict_sys->table_non_LRU, &dict_table_t::table_LRU);

 mutex_create(LATCH_ID_DICT_SYS, &dict_sys->mutex);

 dict_sys->table_hash = hash_create(
  buf_pool_get_curr_size()
  / (DICT_POOL_PER_TABLE_HASH * UNIV_WORD_SIZE));

 dict_sys->table_id_hash = hash_create(
  buf_pool_get_curr_size()
  / (DICT_POOL_PER_TABLE_HASH * UNIV_WORD_SIZE));

1228 rw_lock_create(dict_operation_lock_key,
         dict_operation_lock, SYNC_DICT_OPERATION);

 if (!srv_read_only_mode) {
  dict_foreign_err_file = os_file_create_tmpfile(NULL);
  ut_a(dict_foreign_err_file);
 }

 mutex_create(LATCH_ID_DICT_FOREIGN_ERR, &dict_foreign_err_mutex);

 dict_sys->autoinc_map = new autoinc_map_t();
}

dict_operation_lock 的注释显示，create table、drop table 等操作需要持有 X 锁，而一些其他后台进程，比如Main Thread 检查 dict cache 时，也需要获取 dict_operation_lock 的 X 锁。

/** @brief the data dictionary rw-latch protecting dict_sys
table create, drop, etc. reserve this in X-mode; 
implicit or backround operations purge, rollback, foreign key checks reserve this in S-mode;
we cannot trust that MySQL protects implicit or background operations a table drop since MySQL does not know of them;
therefore we need this; NOTE: a transaction which reserves this must keep book on the mode in trx_t::dict_operation_lock_mode */

查看其它信号量等待日志，显示大量线程等待 140366569264896 线程持有的锁。

2023-05-17T09:05:00.240603+08:00 0 [Warning] InnoDB: A long semaphore wait:
--Thread 140365745075968 has waited at row0undo.cc line 330 for 763  seconds the semaphore:
S-lock on RW-latch at 0x7facfe7cb478 created in file dict0dict.cc line 1228
a writer (thread id 140366569264896) has reserved it in mode  wait exclusive
number of readers 1, waiters flag 1, lock_word: ffffffffffffffff
Last time read locked in file row0undo.cc line 330
Last time write locked in file /export/wyl/mysql-5.7.33/storage/innobase/srv/srv0srv.cc line 1989
2023-05-17T09:05:00.240622+08:00 0 [Warning] InnoDB: A long semaphore wait:
--Thread 140358711588608 has waited at row0undo.cc line 330 for 603  seconds the semaphore:
S-lock on RW-latch at 0x7facfe7cb478 created in file dict0dict.cc line 1228
a writer (thread id 140366569264896) has reserved it in mode  wait exclusive
number of readers 1, waiters flag 1, lock_word: ffffffffffffffff
Last time read locked in file row0undo.cc line 330
Last time write locked in file /export/wyl/mysql-5.7.33/storage/innobase/srv/srv0srv.cc line 1989

主线程异常

Thread 140366569264896 为 MySQL 的 Main Thread，显示 state: enforcing dict cache limit，表明它处于检查字典缓冲区内存使用情况及清理状态。

上面提到，140366569264896 线程 self lock，因此是主线程在等待自己的互斥锁。

--------------
ROW OPERATIONS
--------------
0 queries inside InnoDB, 0 queries in queue
719 read views open inside InnoDB
Process ID=9607, Main thread ID=140366569264896, state: enforcing dict cache limit
Number of rows inserted 181533902, updated 67507278, deleted 128946319, read 14847284319
4022.10 inserts/s, 9.60 updates/s, 0.00 deletes/s, 72051.29 reads/s
----------------------------

根据源码，srv0srv.cc:srv_master_do_active_tasks() 方法中修改主线程状态为 state: enforcing dict cache limit，然后调用 srv_master_evict_from_table_cache(50) 清理 InnoDB 层表定义缓存（字典缓存 table cache）中不需要的表。table cache 是一个 LRU 链表，用于缓存最近访问的 table 的元信息，清理时调用 rw_lock_create() 方法发生阻塞。

/*********************************************************************//**
Perform the tasks that the master thread is supposed to do when the
server is active. There are two types of tasks. The first category is
of such tasks which are performed at each inovcation of this function.
We assume that this function is called roughly every second when the
server is active. The second category is of such tasks which are
performed at some interval e.g.: purge, dict_LRU cleanup etc. */
static
void
srv_master_do_active_tasks(void)
/*============================*/
{
    ...
    if (cur_time % SRV_MASTER_DICT_LRU_INTERVAL == 0) {
  srv_main_thread_op_info = "enforcing dict cache limit";
  ulint n_evicted = srv_master_evict_from_table_cache(50);
  if (n_evicted != 0) {
   MONITOR_INC_VALUE(
    MONITOR_SRV_DICT_LRU_EVICT_COUNT, n_evicted);
  }
  MONITOR_INC_TIME_IN_MICRO_SECS(
   MONITOR_SRV_DICT_LRU_MICROSECOND, counter_time);
 }
    ...
}

调用链如下所示。

// state: enforcing dict cache limit, cur_time % SRV_MASTER_DICT_LRU_INTERVAL == 0
- srv_master_do_active_tasks

// Make room in the table cache by evicting an unused table
-- srv_master_evict_from_table_cache

--- rw_lock_x_lock(dict_operation_lock)

// Inits the data dictionary module
---- dict_init:: rw_lock_create(dict_operation_lock_key,
         dict_operation_lock, SYNC_DICT_OPERATION);

每隔 SRV_MASTER_DICT_LRU_INTERVAL 时间即 47s 调用一次该后台进程，用于检查并淘汰 LRU 链表中不需要的表。

/* Interval in seconds at which various tasks are performed by the
master thread when server is active. In order to balance the workload,
we should try to keep intervals such that they are not multiple of
each other. For example, if we have intervals for various tasks
defined as 5, 10, 15, 60 then all tasks will be performed when
current_time % 60 == 0 and no tasks will be performed when
current_time % 5 != 0. */

# define SRV_MASTER_CHECKPOINT_INTERVAL  (7)
# define SRV_MASTER_PURGE_INTERVAL      (10)
# define SRV_MASTER_DICT_LRU_INTERVAL  (47)

因此，Main Thread 检查 dict cache 获取 dict_operation_lock 的 X-lock 时阻塞。

那么，阻塞的原因是什么呢？

参考 state: enforcing dict cache limit 导致数据库 hang 的 bug report，Bug #84424 InnoDB main thread blocks all writes with a state: enforcing dict cache limit，官方不认为这是 bug，并提出以下几点建议：

increase the additional memory pool
increase total number of file handles available to MySQL
increase number of file handles for InnoDB
improve performance of the I/O on your operating system

结合 page cleaner 超时异常，怀疑是 IO 异常导致刷脏慢持锁未释放，进行阻塞主线程清理表定义缓存。

io await

云侧排查宿主机底层磁盘，发现该时间段 IO 异常，表现为宿主机 io await 高，最高达到 2ms。

可是，io await 2ms 很慢吗？

实际上，await 是单次 IO 的平均耗时，理论上是微秒量级。假设一条查询语句需要 1000 次 IO，平均每次 IO 用时 2ms，总用时 2s，因此 2ms 已经很高了。

因此，怀疑是 IO 异常导致刷脏慢持有 RW-latch 锁未释放，进而阻塞了主线程清理字典缓存的定时任务，主线程 self lock 导致锁等待与连接数增加。

Mutex LOCK_SYS_WAIT

错误日志中还可以看到大量 Mutex LOCK_SYS_WAIT。

----------
SEMAPHORES
----------
OS WAIT ARRAY INFO: reservation count 105755475
--Thread 140357802583808 has waited at lock0wait.cc line 220 for 3  seconds the semaphore:
Mutex at 0x7fb1d97a6928, Mutex LOCK_SYS_WAIT created lock0lock.cc:456, lock var 1

--Thread 140346036541184 has waited at lock0wait.cc line 220 for 3  seconds the semaphore:
Mutex at 0x7fb1d97a6928, Mutex LOCK_SYS_WAIT created lock0lock.cc:456, lock var 1

--Thread 140346971903744 has waited at lock0wait.cc line 220 for 3  seconds the semaphore:
Mutex at 0x7fb1d97a6928, Mutex LOCK_SYS_WAIT created lock0lock.cc:456, lock var 1

--Thread 140346103715584 has waited at lock0wait.cc line 220 for 3  seconds the semaphore:
Mutex at 0x7fb1d97a6928, Mutex LOCK_SYS_WAIT created lock0lock.cc:456, lock var 1

--Thread 140350321760000 has waited at lock0wait.cc line 220 for 3  seconds the semaphore:
Mutex at 0x7fb1d97a6928, Mutex LOCK_SYS_WAIT created lock0lock.cc:456, lock var 1

--Thread 140347785836288 has waited at lock0wait.cc line 220 for 3  seconds the semaphore:
Mutex at 0x7fb1d97a6928, Mutex LOCK_SYS_WAIT created lock0lock.cc:456, lock var 1

--Thread 140345126926080 has waited at lock0wait.cc line 220 for 3  seconds the semaphore:
Mutex at 0x7fb1d97a6928, Mutex LOCK_SYS_WAIT created lock0lock.cc:456, lock var 1

--Thread 140347615676160 has waited at lock0wait.cc line 220 for 3  seconds the semaphore:
Mutex at 0x7fb1d97a6928, Mutex LOCK_SYS_WAIT created lock0lock.cc:456, lock var 1

其中大量出现以下锁等待信息：

等锁，lock0wait.cc line 220
持锁，lock0lock.cc line 456

lock0wait.cc line 220 调用 lock_wait_mutex_enter() 方法，在 lock_wait_suspend_thread() 方法内部调用，用于挂起线程等待行锁。因此当热点行更新时，同时需要对挂起的线程进行锁超时检测，当挂起的线程较多时，这个检测会出现雪崩效应，导致热点行更新“更慢“。

/***************************************************************//**
Puts a user OS thread to wait for a lock to be released. If an error
occurs during the wait trx->error_state associated with thr is
!= DB_SUCCESS when we return. DB_LOCK_WAIT_TIMEOUT and DB_DEADLOCK
are possible errors. DB_DEADLOCK is returned if selective deadlock
resolution chose this transaction as a victim. */
void
lock_wait_suspend_thread(
/*=====================*/
 que_thr_t* thr) /*!< in: query thread associated with the
    user OS thread */
{
    ...
    /* InnoDB system transactions (such as the purge, and
 incomplete transactions that are being rolled back after crash
 recovery) will use the global value of
 innodb_lock_wait_timeout, because trx->mysql_thd == NULL. */
 lock_wait_timeout = trx_lock_wait_timeout_get(trx);

220 lock_wait_mutex_enter();

 trx_mutex_enter(trx);
    ...
        
    /* Release the slot for others to use */

 lock_wait_table_release_slot(slot);
    ....
}

lock0lock.cc line 456 调用 mutex_create() 方法加锁。

/*********************************************************************//**
Creates the lock system at database start. */
void
lock_sys_create(
/*============*/
 ulint n_cells) /*!< in: number of slots in lock hash table */
{
    ...
 lock_sys->waiting_threads = static_cast<srv_slot_t*>(ptr);

 lock_sys->last_slot = lock_sys->waiting_threads;

 mutex_create(LATCH_ID_LOCK_SYS, &lock_sys->mutex);

456 mutex_create(LATCH_ID_LOCK_SYS_WAIT, &lock_sys->wait_mutex);

 lock_sys->timeout_event = os_event_create(0);
    ...
}

mutex 是用于保护对象的锁，其中lock_sys->mutex与lock_sys->wait_mutex用于保护事务。

InnoDB 在每个事务中，需要扫描当前已经打开的事务列表 trx_list，并拷贝没有提交的事务 ID。在扫描事务列表 trx_list 时，会使用 kernel_mutex 加锁，这也是性能的最大瓶颈之处。

coredump 文件中显示有 54w 个线程在等待此 latch。

[root@test zhangkai321]# grep "has waited at lock0wait.cc line 220" mysql-error-20230516_2114.log | wc -l
540212

原理

latch

概念

首先需要区分 lock 与 latch，尽管两者都可以被称为”锁“，但是区别很大。

lock 锁的对象是事务，用于锁定数据库中的对象，如表、页、行；
latch 锁的对象是线程，用于锁定内存中的临界资源（共享数据），保证并发线程操作临界资源的正确性，如 dict_operation_lock、用于维护 Adaptive hash index（AHI）的 btr_search_latches、checkpoint_lock。

两者的对比见下表。

类型	lock	latch
对象	事务	线程
保护	数据库内容	内存数据结构
持续时间	整个事务过程	临界资源
模式	行锁、表锁、意向锁	读写锁、互斥锁
死锁	通过 waits-for graph、time out 等机制进行死锁检测与处理	无死锁检测与处理机制，仅通过应用程序集按所的顺序（lock leveling）保证无死锁发生
存在于	Lock Manager 的哈希表中	每个数据结构的对象中

注意 latch 是轻量级的锁，因为其要求锁定的时间必须非常短。若持续的时间长，则应用的性能会非常差。

由于 latch 没有死锁检测机制，因此出现问题时，需要从现场 core 文件排查持锁与等锁的关系。

在 InnoDB 存储引擎中，latch 又可以分为以下两种：

mutex，互斥量
RW-Lock，读写锁

参考官方文档，mutex 是一种同步机制，用于保证同一时间只有一个线程可以访问同一份数据。如果多个线程同时申请访问同一份数据，第一个线程持有 mutex，其他线程等待其释放锁。

A mutex is a synchronization mechanism used in the code to enforce that only one thread at a given time can have access to a common resource. When two or more threads executing in the server need to access the same resource, the threads compete against each other. The first thread to obtain a lock on the mutex causes the other threads to wait until the lock is released.

通过命令show engine innodb mutex可以查看 InnoDB 中 latch 的详细信息。

如下所示，列 Type 始终为 InnoDB，列 Name 显示 latch 的信息以及所在源码的位置（行数），列 Status 显示 os_waits。

mysql> show engine innodb mutex;
+--------+-----------------------------+------------+
| Type   | Name                        | Status     |
+--------+-----------------------------+------------+
| InnoDB | rwlock: dict0dict.cc:2737   | waits=7    |
| InnoDB | rwlock: fil0fil.cc:1381     | waits=6    |
| InnoDB | rwlock: dict0dict.cc:2737   | waits=68   |
| InnoDB | rwlock: dict0dict.cc:1183   | waits=960  |
| InnoDB | rwlock: log0log.cc:838      | waits=1216 |
| InnoDB | rwlock: btr0sea.cc:195      | waits=1    |
| InnoDB | rwlock: hash0hash.cc:353    | waits=2    |
| InnoDB | sum rwlock: buf0buf.cc:1460 | waits=1124 |
+--------+-----------------------------+------------+
8 rows in set (0.00 sec)

Debug 版本下，列 Status 的信息更多，具体见下表。

名称	说明
count	mutex 被请求的次数
spin_waits	spin lock（自旋锁）的次数，InnoDB 存储引擎 latch 在不能获得锁时首先自旋，若自旋后还不能获得锁，则进入等待状态
spin_rounds	自旋内部循环的总次数，每次自旋的内部循环是一个随机数。spin_rounds/spin_waits 表示平均每次自旋所需要的内部循环次数
os_waits	表示操作系统等待的次数。当 spin lock 通过自旋还不能活得 latch 时，则会进入操作系统等待状态，等待被唤醒
os_yields	进行 os_thread_yield 唤醒操作的次数
os_wait_times	操作系统等待的时间，单位为 ms

latch 争用

上面提到，一个线程想获得一个锁，但是该锁已被另一线程持有，该线程首先自旋 spin，自旋还不能获取 latch 时，进入操作系统等待状态，退出 CPU，等待被唤醒，通常表现为 latch 争用。

OS waits 表示操作系统等待的次数，因此如果 OS waits 值比较高，表明 latch 争用严重，如下所示。

[root@test zhangkai321]# cat mysql-error-20230516_2114.log |grep "OS waits"|more
RW-shared spins 0, rounds 243635055, OS waits 31916294
RW-excl spins 0, rounds 1002553336, OS waits 11537067
RW-sx spins 39373015, rounds 872584318, OS waits 14370402
RW-shared spins 0, rounds 243642818, OS waits 31916460
RW-excl spins 0, rounds 1002560958, OS waits 11537179
RW-sx spins 39373015, rounds 872584318, OS waits 14370402
RW-shared spins 0, rounds 243646973, OS waits 31916553
RW-excl spins 0, rounds 1002564853, OS waits 11537285
RW-sx spins 39373017, rounds 872584378, OS waits 14370404
RW-shared spins 0, rounds 243647098, OS waits 31916612
RW-excl spins 0, rounds 1002565006, OS waits 11537290
RW-sx spins 39373032, rounds 872584828, OS waits 14370419
RW-shared spins 0, rounds 243647105, OS waits 31916614
RW-excl spins 0, rounds 1002565038, OS waits 11537291
RW-sx spins 39373033, rounds 872584858, OS waits 14370420
RW-shared spins 0, rounds 243647143, OS waits 31916630
RW-excl spins 0, rounds 1002565332, OS waits 11537324
RW-sx spins 39373033, rounds 872584858, OS waits 14370420
RW-shared spins 0, rounds 243647143, OS waits 31916630
RW-excl spins 0, rounds 1002565422, OS waits 11537327
RW-sx spins 39373033, rounds 872584858, OS waits 14370420
RW-shared spins 0, rounds 243647153, OS waits 31916635
RW-excl spins 0, rounds 1002565623, OS waits 11537336
RW-sx spins 39373033, rounds 872584858, OS waits 14370420
RW-shared spins 0, rounds 243648940, OS waits 31917108
RW-excl spins 0, rounds 1002575838, OS waits 11537613

MySQL 中缓存的数据保存在 buffer pool 中，如果多线程并发访问或修改数据（块 page）时通过 latch 实现并发控制。

latch 争用可能的原因包括：

异常 SQL，大量物理读需要修改链表，或大量内存读频繁访问链表；
链表过长，链上有一个链的保护机制 latch，如果链表上挂的块太多，被持有的概率太大。

因此，降低 latch 争用的常见方法包括：

优化 SQL，降低内存读的数量；
增加 innodb_buffer_pool_instances 数量，每个 buffer pool 都有自己独立的 latch，通过将需要缓存的数据 hash 到不同的缓冲池中，可以并行的内存读写，最大限度减少并发操作中内存结构的争用。

实际上，只有一个 buffer pool 时，如果存在大量数据页争用，可以通过增加 innodb_buffer_pool_instances 数量，将数据页打散在不同 buffer pool 来实现。

但是，极端条件下如果单个数据页或单个 buffer pool 异常热，增加 innodb_buffer_pool_instances 无效，可以通过减小数据页的大小或降低数据页中可用于保存数据的比例大小的方式来缓解。

%iowait、await、%util

以下是常见的衡量硬盘负载的三个指标：

%iowait，单位是百分比，表示 CPU 被 IO 约束导致空闲；
await，单位是 ms，表示每个 IO 的平均耗时，包括在内核io队列内的时间和在存储设备上执行此io 的时间；
%util，单位是百分比，表示设备有 I/O（即非空闲）的时间比率，不考虑 I/O 有多少，只考虑有没有。

但是上述三个指标单纯的高，并不一定表示磁盘达到瓶颈，需要结合场景具体分析。

下面介绍两种常见的误解。

1）高 %iowait 表明系统中有许多进程在等待磁盘 IO，低 %iowait 表明 IO 没有达到瓶颈？

实际上不是，如果等待磁盘 IO 期间其他进程占用 CPU，该时间将作为用户态/系统态时间计入到它们的 CPU 使用率，而不会被计算到 %iowait，即使是 IO 完全卡死。

根本原因是 %iowait 只是反映 “CPU空闲并且有未完成的IO ” 这种状态时间的占比，两个条件缺少一个都不会计入 %iowait 的时间。

2）高 %util 说明产生的 IO 请求太多，表明 IO 达到瓶颈？

实际上也不是，由于现代硬盘设备都有并行处理多个 I/O 请求的能力，所以 %util 即使达到 100% 也不意味着设备饱和了。

举个简化的例子：某硬盘处理单个 I/O 需要0.1秒，有能力同时处理 10 个I/O 请求，那么当 10 个 I/O 请求依次顺序提交的时候，需要 1 秒才能全部完成，在 1 秒的采样周期里 %util 达到 100%；而如果 10 个 I/O 请求一次性提交的话，0.1 秒就全部完成，在 1 秒的采样周期里 %util 只有 10% 。可见，即使 %util 高达 100%，硬盘也仍然有可能还有余力处理更多的 I/O 请求，即没有达到饱和状态。很可惜，iostat 并没有用来判断饱和度的指标。

热点行更新慢

引用一篇网络教程，其中很好地解释了更新越多越慢的现象的原因。

具体是腾讯云内部系统不定期出现数据库访问行更新慢，数据库用户线程大量堆积的现象。

慢日志显示偶发大量 update 执行用时超过10s，经分析存在热点行更新，具体是短时间内多次更新同一行数据，如果更新超时，还将重试 10 次。极端场景下，叠加重试请求，会有 2000+ 线程同时更新数据库，造成大量连接等待现象。

并发更新会给行数据加锁，行锁在事务提交时释放，并唤醒其他线程继续更新。但是正常情况下热点行更新会降低数据库吞吐但不会产生数十秒的事务等待，因此怀疑加锁、释放锁、唤醒其他线程的某些环节有问题，导致大并发的极端情况下数据库性能严重下降。

复现问题后结合 pt-pmp、perf 等工具分析数据库时发现两组线程等待异常，具体涉及到两把锁：

用户线程调用的lock_wait_suspend_thread
后台线程：lock_wait_timeout_thread

lock_wait_suspend_thread 函数让所有调用线程进入 suspend 状态，挂起。当热点行更新时，只有一个线程更新其他所有线程都挂起等待行锁，因此在热点行更新时，这个函数是热点。

lock_wait_timeout_thread 是锁超时监测线程，监测是否有线程等锁超时，该线程会扫描每一个由 lock_wait_suspend_thread 进入等待的线程，判断其是否超时，如果有超时，唤醒。触发机制有 2 种：

每秒定时触发
lock_wait_suspend_thread 通知触发，用于加速检测锁超时等待，这个就是热点行更新慢的关键！

lock_wait_suspend_thread 与 lock_wait_timeout_thread 关系如下图所示：

每增加 n 个进入等待进程，其中的每一个进程都会触发一次 lock_wait_timeout_thread，而每一次 lock_wait_timeout_thread 调用都会对 m 个已经挂起线程持锁扫描，这样多出 n * m 次持锁扫描。由于持锁扫描会进一步加剧用户线程等待在 lock_wait_suspend_thread 入口锁 lock_wait_mutex_enter() 的等待，造成更多触发 lock_wait_timeout_thread 线程的机会，单次运行 lock_wait_timeout_thread 时间增加，导致进入恶性循环，所有线程都在争夺 lock_wait_mutex_enter() 锁。

因此当热点行更新时，同时需要对挂起的线程进行锁超时检测，当挂起的线程较多时，这个检测会出现雪崩效应，导致热点行更新“更慢“。

注意 lock_wait_mutex_enter 是实例级别的全局唯一锁，因此遇到热点行时会成为瓶颈。

解决热点行的常见方法包括：

减小数据页的大小
降低数据页中可用于保存数据的比例大小
增加 innodb_buffer_pool_instances 数量

InnoDB 层表定义缓存

MySQL 中将访问的数据缓存在 buffer pool 的 LRU list 中，而每次执行 SQL 时都需要访问表定义信息，也会缓存下来吗？

MySQL 的表包含表名，表空间、索引、列、约束等信息，这些表的元数据可以暂且称为表定义信息。

对于 InnoDB 来说，MySQL 在 server 层和 engine 层都有表定义信息。server 层的表定义记录在 rm文件中，而 InnoDB 层的表定义信息存储在 InnoDB 系统表中。

如果每次执行 SQL 访问表定义信息都需要从从 frm 文件或系统表中获取，效率会较低。因此 MySQL 在 server 层和 InnoDB 层都有表定义的缓存。以 MySQL 5.6 为例，参数 table_definition_cache 控制了表定义缓存中表的个数，server 层和 InnoDB 层的表定义缓存共用此参数。

server 层表定义缓存与 InnoDB 层表定义缓存都由两部分组成，包括哈希表与链表，链表基于 LRU 算法实现。

其中：

server 层表定义缓存的哈希表以表名为 key 缓存 TABLE_SHARE 对象，TABLE_SHARE 对象有引用计数和版本信息。获取 TABLE_SHARE（get_table_share）先从 HASH 查找，找不到再读取 frm 文件加载表定义信息。同时递增引用计数；
InnoDB 层表定义缓存的链表由主线程控制每47（SRV_MASTER_DICT_LRU_INTERVAL）秒检查一次 LRU 链表，清理控制缓存表个数不能超过 table_definition_cache，可以淘汰的表要求表的引用计数为 0，表的索引被自适应哈希引用计数也为 0。

结论

监控显示数据库连接打满，错误日志显示刷脏慢，并且存在大量信号量等待，怀疑数据库 hang。

经分析，怀疑是 IO 异常导致刷脏慢持有 RW-latch 锁未释放，进而阻塞了主线程清理字典缓存的定时任务，主线程 self lock 并导致锁等待与连接数增加。

其中：

宿主机 io await 高，最高达到 2ms；
主线程中拆分出的 Page Cleaner Thread 从 flush_list 中刷新脏页，IO 抖动导致长时间持锁；
Main Thread Thread state: enforcing dict cache limit，主线程定时调用 srv_master_evict_from_table_cache 清理 InnoDB 层表定义缓存，申请 RW-latch 时导致主线程 self lock；
热点行更新会导致越多越慢，原因是需要对挂起的线程进行锁超时检测，当挂起的线程较多时，这个检测会出现雪崩效应，导致出现大量 Mutex LOCK_SYS_WAIT 。解决热点行的常见方法包括：