引言

MHA 是一款成熟且开源的 MySQL 高可用程序，本文介绍一个子进程绑定 VIP 返回异常导致 MHA 切换失败的案例。

介绍

2022-10-21 23:23:23 告警显示一台数据库无响应，而主从切换程序显示切换失败，但是并不确定切换进度。

处理过程中发现，主库可以 ping 通，ssh 也通，但是登录数据库失败。从库复制关系正常。

大约三分钟以后主库登录数据库成功，但是业务反馈写入失败，报错只读，检查后才发现是主 VIP 已切换到从库，而从库未关闭只读，手动将主 VIP 绑定到主库后恢复正常。

实际上，在第一时间发现问题时尽管对于 MHA 切换流程不熟悉，但如果严格按照切换逻辑检查，完全可以及时处理进而缩短对业务的影响时间。原因三分钟就可以解决的问题最终用了半小时才解决。

在检查切换进度时首先需要关注 vip 的绑定关系是否正确，原因是 MHA 切换一开始就会将 vip 摘掉。

通过排查日志与代码，分析并复现出该问题，子进程绑定 VIP 返回异常导致 MHA 切换失败。

现场处理

监控

2022-10-21 23:23:23 zabbix 告警显示一台数据库无响应，由于当时没有截图保存，因此查看告警记录。

mysql> select * from sms where host like '%x.x.x.78%' and date_format(time,"%Y-%m-%d")='2022-10-21' limit 1 G
*************************** 1. row ***************************
             id: 2598165
           time: 2022-10-21 23:23:23
           host: x.x.x.78
   TRIGGER_NAME: MySQL已5分钟无响应
         DETAIL: 详细:MySQL is alive:1
         STATUS: PROBLEM
           type: 0
           zone: DB
      send_type: 1
sms_send_status: 1
     alarm_type: 0
1 row in set (2.02 sec)

登录服务器显示 IO 打满，IO 繁忙度超过 90%。

[2022/10/21 23:31:20] [root@DB ~]# iostat -x 1 10 
[2022/10/21 23:31:20] Linux 2.6.32-431.el6.x86_64 (DB10110078)  10/21/2022  _x86_64_ (32 CPU)
[2022/10/21 23:31:20] 
[2022/10/21 23:31:20] avg-cpu:  %user   %nice %system %iowait  %steal   %idle
[2022/10/21 23:31:20]            3.66    0.00    0.47    0.64    0.00   95.22
[2022/10/21 23:31:20] 
[2022/10/21 23:31:20] Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
[2022/10/21 23:31:20] sda               3.46  1116.02  660.05 1354.14 21223.36 30133.13    25.50     0.20    0.10   0.10  20.32
[2022/10/21 23:31:20] 
[2022/10/21 23:31:21] avg-cpu:  %user   %nice %system %iowait  %steal   %idle
[2022/10/21 23:31:21]            0.47    0.00    3.19    2.56    0.00   93.78
[2022/10/21 23:31:21] 
[2022/10/21 23:31:21] Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
[2022/10/21 23:31:21] sda               0.00     0.00 2222.00    0.00 71088.00     0.00    31.99     0.87    0.39   0.39  87.10
[2022/10/21 23:31:21] 
[2022/10/21 23:31:22] avg-cpu:  %user   %nice %system %iowait  %steal   %idle
[2022/10/21 23:31:22]            0.41    0.00    3.22    2.66    0.00   93.72
[2022/10/21 23:31:22] 
[2022/10/21 23:31:22] Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
[2022/10/21 23:31:22] sda               0.00     0.00 2320.00    0.00 74240.00     0.00    32.00     0.91    0.39   0.39  90.60
[2022/10/21 23:31:22] 
[2022/10/21 23:31:23] avg-cpu:  %user   %nice %system %iowait  %steal   %idle
[2022/10/21 23:31:23]            0.37    0.00    3.19    2.75    0.00   93.69
[2022/10/21 23:31:23] 
[2022/10/21 23:31:23] Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
[2022/10/21 23:31:23] sda               0.00     0.00 2291.00    0.00 73312.00     0.00    32.00     0.92    0.40   0.40  92.20
[2022/10/21 23:31:23] 
[2022/10/21 23:31:24] avg-cpu:  %user   %nice %system %iowait  %steal   %idle
[2022/10/21 23:31:24]            0.44    0.00    3.19    2.63    0.00   93.75
[2022/10/21 23:31:24] 
[2022/10/21 23:31:24] Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
[2022/10/21 23:31:24] sda               0.00     2.00 2193.00    4.00 70176.00    48.00    31.96     0.90    0.41   0.41  89.60
[2022/10/21 23:31:24] 
[2022/10/21 23:31:25] avg-cpu:  %user   %nice %system %iowait  %steal   %idle
[2022/10/21 23:31:25]            0.59    0.00    3.19    2.53    0.00   93.69
[2022/10/21 23:31:25] 
[2022/10/21 23:31:25] Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
[2022/10/21 23:31:25] sda               0.00     0.00 2113.00    0.00 67616.00     0.00    32.00     0.84    0.40   0.40  84.40
[2022/10/21 23:31:25] 
[2022/10/21 23:31:26] avg-cpu:  %user   %nice %system %iowait  %steal   %idle
[2022/10/21 23:31:26]            0.44    0.00    3.41    2.69    0.00   93.47
[2022/10/21 23:31:26] 
[2022/10/21 23:31:26] Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
[2022/10/21 23:31:26] sda               0.00     3.00 2281.00    3.00 72992.00    48.00    31.98     0.92    0.40   0.40  91.50
[2022/10/21 23:31:26] 
[2022/10/21 23:31:27] avg-cpu:  %user   %nice %system %iowait  %steal   %idle
[2022/10/21 23:31:27]            0.25    0.00    3.19    2.78    0.00   93.78
[2022/10/21 23:31:27] 
[2022/10/21 23:31:27] Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
[2022/10/21 23:31:27] sda               0.00     0.00 2381.00    0.00 76192.00     0.00    32.00     0.96    0.40   0.40  95.80
[2022/10/21 23:31:27] 
[2022/10/21 23:31:28] avg-cpu:  %user   %nice %system %iowait  %steal   %idle
[2022/10/21 23:31:28]            0.44    0.00    3.19    2.60    0.00   93.78
[2022/10/21 23:31:28] 
[2022/10/21 23:31:28] Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
[2022/10/21 23:31:28] sda               0.00     3.00 2181.00    7.00 69792.00    80.00    31.93     0.87    0.40   0.40  87.10
[2022/10/21 23:31:28] 
[2022/10/21 23:31:29] avg-cpu:  %user   %nice %system %iowait  %steal   %idle
[2022/10/21 23:31:29]            0.56    0.00    3.16    2.50    0.00   93.78
[2022/10/21 23:31:29] 
[2022/10/21 23:31:29] Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
[2022/10/21 23:31:29] sda               0.00     0.00 2147.00    0.00 68704.00     0.00    32.00     0.85    0.39   0.39  84.60

登录数据库

登录数据库失败

[2022/10/21 23:28:19] root@DB # /export/servers/mysql/bin/mysql -uadmin -p -hx.x.x.78
[2022/10/21 23:28:21] Enter password: 
[2022/10/21 23:28:21] 
[2022/10/21 23:28:51] ^C

登录服务器

服务器可以 ping 通

[2022/10/21 23:30:55] root@DB # ping x.x.x.78
[2022/10/21 23:30:55] PING x.x.x.78 (x.x.x.78) 56(84) bytes of data.
[2022/10/21 23:30:55] 64 bytes from x.x.x.78: icmp_seq=1 ttl=54 time=36.4 ms
[2022/10/21 23:30:56] 64 bytes from x.x.x.78: icmp_seq=2 ttl=54 time=36.2 ms
[2022/10/21 23:30:56] ^C
[2022/10/21 23:30:56] --- x.x.x.78 ping statistics ---
[2022/10/21 23:30:56] 2 packets transmitted, 2 received, 0% packet loss, time 1508ms
[2022/10/21 23:30:56] rtt min/avg/max/mdev = 36.200/36.313/36.426/0.113 ms

自动切换异常

主从切换程序显示切换失败。

MySQL 子进程绑定 VIP 返回异常导致 MHA 切换失败 — image-20230108204102202

切换进度

由于不确定切换进度，因此查看了下从库的复制状态，结果显示主从复制关系还在，其中 IO 线程与 SQL 线程中断。

[2022/10/21 23:28:54] root@DB # /export/servers/mysql/bin/mysql -uadmin -p -hx.x.x.24
[2022/10/21 23:28:56] mysql> 
[2022/10/21 23:28:57] mysql> SHOW SLAVE STATUS G
[2022/10/21 23:28:57] *************************** 1. row ***************************
[2022/10/21 23:28:57]                Slave_IO_State: 
[2022/10/21 23:28:57]                   Master_Host: x.x.x.78
[2022/10/21 23:28:57]                   Master_User: replicater
[2022/10/21 23:28:57]                   Master_Port: 3358
[2022/10/21 23:28:57]                 Connect_Retry: 60
[2022/10/21 23:28:57]               Master_Log_File: mysql-bin.196738
[2022/10/21 23:28:57]           Read_Master_Log_Pos: 360334210
[2022/10/21 23:28:57]                Relay_Log_File: relay-log.001882
[2022/10/21 23:28:57]                 Relay_Log_Pos: 360334263
[2022/10/21 23:28:57]         Relay_Master_Log_File: mysql-bin.196738
[2022/10/21 23:28:57]              Slave_IO_Running: No
[2022/10/21 23:28:57]             Slave_SQL_Running: No
[2022/10/21 23:28:57]           Exec_Master_Log_Pos: 360334210
[2022/10/21 23:28:57]         Seconds_Behind_Master: NULL

数据库可以登录，有大量 kill.sh 脚本执行中。

[2022/10/21 23:30:58] root@DB# ssh x.x.x.78
[2022/10/21 23:30:58] Last login: Fri Oct 21 23:29:18 2022 from x.x.x.71
[2022/10/21 23:30:59] [root@DB ~]# 
[2022/10/21 23:31:04] [root@DB ~]# ps -ef|grep mysql
[2022/10/21 23:31:04] root      45784  45780  0 23:21 ?        00:00:00 /bin/sh -c sh /export/mysql_script/mysql_tools/kill.sh > /dev/null 2>&1
[2022/10/21 23:31:04] root      45785  45784  0 23:21 ?        00:00:00 sh /export/mysql_script/mysql_tools/kill.sh
[2022/10/21 23:31:04] root      45797  45785  0 23:21 ?        00:00:00 /export/servers/mysql/bin/mysql -uadmin -px xxxxxxxxxxxxxxx -eshow  databases;
[2022/10/21 23:31:04] root      45803  45786  0 23:21 ?        00:00:00 /export/servers/mysql/bin/mysql -uzabbix -px xxxx -Ne select count(*) from information_schema.processlist where  info is not null and time>600;
[2022/10/21 23:31:04] root      45808  45806  0 23:22 ?        00:00:00 /bin/sh -c sh /export/mysql_script/mysql_tools/kill.sh > /dev/null 2>&1
[2022/10/21 23:31:04] root      45811  45808  0 23:22 ?        00:00:00 sh /export/mysql_script/mysql_tools/kill.sh
[2022/10/21 23:31:04] root      45817  45811  0 23:22 ?        00:00:00 /export/servers/mysql/bin/mysql -uadmin -px xxxxxxxxxxxxxxx -eshow  databases;
[2022/10/21 23:31:04] root      45829  45814  0 23:22 ?        00:00:00 /export/servers/mysql/bin/mysql -uzabbix -px xxxx -Ne select count(*) from information_schema.processlist where  info is not null and time>600;
[2022/10/21 23:31:04] root      45860  45857  0 23:23 ?        00:00:00 /bin/sh -c sh /export/mysql_script/mysql_tools/kill.sh > /dev/null 2>&1
[2022/10/21 23:31:04] root      45863  45860  0 23:23 ?        00:00:00 sh /export/mysql_script/mysql_tools/kill.sh
[2022/10/21 23:31:04] root      45874  45863  0 23:23 ?        00:00:00 /export/servers/mysql/bin/mysql -uadmin -px xxxxxxxxxxxxxxx -eshow  databases;
[2022/10/21 23:31:04] root      45880  45864  0 23:23 ?        00:00:00 /export/servers/mysql/bin/mysql -uzabbix -px xxxx -Ne select count(*) from information_schema.processlist where  info is not null and time>600;
[2022/10/21 23:31:04] root      46010  46007  0 23:24 ?        00:00:00 /bin/sh -c sh /export/mysql_script/mysql_tools/kill.sh > /dev/null 2>&1
[2022/10/21 23:31:04] root      46013  46010  0 23:24 ?        00:00:00 sh /export/mysql_script/mysql_tools/kill.sh
[2022/10/21 23:31:04] root      46024  46013  0 23:24 ?        00:00:00 /export/servers/mysql/bin/mysql -uadmin -px xxxxxxxxxxxxxxx -eshow  databases;
[2022/10/21 23:31:04] root      46030  46014  0 23:24 ?        00:00:00 /export/servers/mysql/bin/mysql -uzabbix -px xxxx -Ne select count(*) from information_schema.processlist where  info is not null and time>600;
[2022/10/21 23:31:04] root      46037  46033  0 23:25 ?        00:00:00 /bin/sh -c sh /export/mysql_script/mysql_tools/kill.sh > /dev/null 2>&1
[2022/10/21 23:31:04] root      46041  46037  0 23:25 ?        00:00:00 sh /export/mysql_script/mysql_tools/kill.sh
[2022/10/21 23:31:04] root      46061  46041  0 23:25 ?        00:00:00 /export/servers/mysql/bin/mysql -uadmin -px xxxxxxxxxxxxxxx -eshow  databases;
[2022/10/21 23:31:04] root      46067  46043  0 23:25 ?        00:00:00 /export/servers/mysql/bin/mysql -uzabbix -px xxxx -Ne select count(*) from information_schema.processlist where  info is not null and time>600;
[2022/10/21 23:31:04] root      46077  46074  0 23:26 ?        00:00:00 /bin/sh -c sh /export/mysql_script/mysql_tools/kill.sh > /dev/null 2>&1
[2022/10/21 23:31:04] root      46080  46077  0 23:26 ?        00:00:00 sh /export/mysql_script/mysql_tools/kill.sh
[2022/10/21 23:31:04] root      46095  46080  0 23:26 ?        00:00:00 /export/servers/mysql/bin/mysql -uadmin -px xxxxxxxxxxxxxxx -eshow  databases;
[2022/10/21 23:31:04] root      46099  46081  0 23:26 ?        00:00:00 /export/servers/mysql/bin/mysql -uzabbix -px xxxx -Ne select count(*) from information_schema.processlist where  info is not null and time>600;
[2022/10/21 23:31:04] root      46192  46190  0 23:27 ?        00:00:00 /bin/sh -c sh /export/mysql_script/mysql_tools/kill.sh > /dev/null 2>&1
[2022/10/21 23:31:04] root      46196  46192  0 23:27 ?        00:00:00 sh /export/mysql_script/mysql_tools/kill.sh
[2022/10/21 23:31:04] root      46207  46196  0 23:27 ?        00:00:00 /export/servers/mysql/bin/mysql -uadmin -px xxxxxxxxxxxxxxx -eshow  databases;
[2022/10/21 23:31:04] root      46213  46195  0 23:27 ?        00:00:00 /export/servers/mysql/bin/mysql -uzabbix -px xxxx -Ne select count(*) from information_schema.processlist where  info is not null and time>600;
[2022/10/21 23:31:04] root      46219  46215  0 23:28 ?        00:00:00 /bin/sh -c sh /export/mysql_script/mysql_tools/kill.sh > /dev/null 2>&1
[2022/10/21 23:31:04] root      46220  46219  0 23:28 ?        00:00:00 sh /export/mysql_script/mysql_tools/kill.sh
[2022/10/21 23:31:04] root      46225  46220  0 23:28 ?        00:00:00 /export/servers/mysql/bin/mysql -uadmin -px xxxxxxxxxxxxxxx -eshow  databases;
[2022/10/21 23:31:04] root      46238  46223  0 23:28 ?        00:00:00 /export/servers/mysql/bin/mysql -uzabbix -px xxxx -Ne select count(*) from information_schema.processlist where  info is not null and time>600;
[2022/10/21 23:31:04] root      46244  46240  0 23:29 ?        00:00:00 /bin/sh -c sh /export/mysql_script/mysql_tools/kill.sh > /dev/null 2>&1
[2022/10/21 23:31:04] root      46247  46244  0 23:29 ?        00:00:00 sh /export/mysql_script/mysql_tools/kill.sh
[2022/10/21 23:31:04] root      46257  46247  0 23:29 ?        00:00:00 /export/servers/mysql/bin/mysql -uadmin -px xxxxxxxxxxxxxxx -eshow  databases;
[2022/10/21 23:31:04] root      46263  46246  0 23:29 ?        00:00:00 /export/servers/mysql/bin/mysql -uzabbix -px xxxx -Ne select count(*) from information_schema.processlist where  info is not null and time>600;
[2022/10/21 23:31:04] root      46398  46392  0 23:30 ?        00:00:00 /bin/sh -c sh /export/mysql_script/mysql_tools/kill.sh > /dev/null 2>&1
[2022/10/21 23:31:04] root      46399  46398  0 23:30 ?        00:00:00 sh /export/mysql_script/mysql_tools/kill.sh
[2022/10/21 23:31:04] root      46409  46399  0 23:30 ?        00:00:00 /export/servers/mysql/bin/mysql -uadmin -px xxxxxxxxxxxxxxx -eshow  databases;
[2022/10/21 23:31:04] root      46424  46404  0 23:30 ?        00:00:00 /export/servers/mysql/bin/mysql -uzabbix -px xxxx -Ne select count(*) from information_schema.processlist where  info is not null and time>600;
[2022/10/21 23:31:04] root      46459  46427  0 23:30 pts/1    00:00:00 grep mysql
[2022/10/21 23:31:04] root      87541      1  0 Jul12 ?        00:00:00 /bin/sh /export/servers/mysql/bin/mysqld_safe --defaults-file=/export/servers/mysql/etc/my.cnf
[2022/10/21 23:31:04] mysql     88618  87541 99 Jul12 ?        188-00:01:25 /export/servers/mysql/bin/mysqld --defaults-file=/export/servers/mysql/etc/my.cnf --basedir=/export/servers/mysql --datadir=/export/data/mysql/data --plugin-dir=/export/servers/mysql/lib/plugin --user=mysql --log-error=/export/data/mysql/log/error.log --open-files-limit=65535 --pid-file=/export/data/mysql/mysql.pid --socket=/export/data/mysql/tmp/mysql.sock --port=3358

数据库运行中，并且可以正常登录。

[2022/10/21 23:32:30] [root@DB ~]# ps -ef|grep mysql
[2022/10/21 23:32:30] root      46698  46427  0 23:32 pts/1    00:00:00 grep mysql
[2022/10/21 23:32:30] root      87541      1  0 Jul12 ?        00:00:00 /bin/sh /export/servers/mysql/bin/mysqld_safe --defaults-file=/export/servers/mysql/etc/my.cnf
[2022/10/21 23:32:30] mysql     88618  87541 99 Jul12 ?        188-00:02:14 /export/servers/mysql/bin/mysqld --defaults-file=/export/servers/mysql/etc/my.cnf --basedir=/export/servers/mysql --datadir=/export/data/mysql/data --plugin-dir=/export/servers/mysql/lib/plugin --user=mysql --log-error=/export/data/mysql/log/error.log --open-files-limit=65535 --pid-file=/export/data/mysql/mysql.pid --socket=/export/data/mysql/tmp/mysql.sock --port=3358

23:28 数据库无法登录，23:31 数据库可以登录，不过位点不变。

[2022/10/21 23:36:51] mysql> SHOW Master STATUS; 
[2022/10/21 23:36:51] +------------------+-----------+--------------+------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
[2022/10/21 23:36:51] | File             | Position  | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set                                                                                                                                                                                                                                                                                                                                                                              |
[2022/10/21 23:36:51] +------------------+-----------+--------------+------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
[2022/10/21 23:36:51] | mysql-bin.196738 | 360491782 |              |                  | 0d1d233b-01f1-11ed-b58d-246e965a0842:1-4344447171,
[2022/10/21 23:36:51] 15c89410-d0f5-11ea-a06f-246e96774111:1-8557185849,
[2022/10/21 23:36:51] 216fc3f0-d38f-11e8-8449-a0369fe0e240:1-51625372,
[2022/10/21 23:36:51] 40e74288-d38f-11e8-8d6e-a0369fef393a:1-1492087776:1492136015-1498353734,
[2022/10/21 23:36:51] 86fc2719-59ac-11ea-9508-38eaa7178ee4:1,
[2022/10/21 23:36:51] aa4b4067-d16c-11ea-97c4-246e9675d891:33-19035308026,
[2022/10/21 23:36:51] c043a384-589d-11ea-b94b-38eaa7178cc1:1-2284366:2341679-1390335892 |
[2022/10/21 23:36:51] +------------------+-----------+--------------+------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
[2022/10/21 23:36:51] 1 row in set (0.00 sec)
[2022/10/21 23:36:51] 
[2022/10/21 23:36:51] mysql> SHOW Master STATUS; 
[2022/10/21 23:36:51] +------------------+-----------+--------------+------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
[2022/10/21 23:36:51] | File             | Position  | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set                                                                                                                                                                                                                                                                                                                                                                              |
[2022/10/21 23:36:51] +------------------+-----------+--------------+------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
[2022/10/21 23:36:51] | mysql-bin.196738 | 360491782 |              |                  | 0d1d233b-01f1-11ed-b58d-246e965a0842:1-4344447171,
[2022/10/21 23:36:51] 15c89410-d0f5-11ea-a06f-246e96774111:1-8557185849,
[2022/10/21 23:36:51] 216fc3f0-d38f-11e8-8449-a0369fe0e240:1-51625372,
[2022/10/21 23:36:51] 40e74288-d38f-11e8-8d6e-a0369fef393a:1-1492087776:1492136015-1498353734,
[2022/10/21 23:36:51] 86fc2719-59ac-11ea-9508-38eaa7178ee4:1,
[2022/10/21 23:36:51] aa4b4067-d16c-11ea-97c4-246e9675d891:33-19035308026,
[2022/10/21 23:36:51] c043a384-589d-11ea-b94b-38eaa7178cc1:1-2284366:2341679-1390335892 |
[2022/10/21 23:36:51] +------------------+-----------+--------------+------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
[2022/10/21 23:36:51] 1 row in set (0.00 sec)

该数据库是报表库，通过蜂巢系统从生产库自动同步至报表库。

WMS 系统运营反馈报表不更新，首先怀疑蜂巢需要重启，联系蜂巢后反馈写入报错只读。

java.sql.SQLException: The MySQL server is running with the --read-only option so it cannot execute this statement

然后查看 VIP 绑定关系，发现主从 VIP 已经都绑定到了从库，而从库只读未关闭，因此需要手动处理。

主库，主 VIP 已摘掉。

[2022/10/21 23:39:50] [root@DB ~]# ip a
[2022/10/21 23:39:50] 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN 
[2022/10/21 23:39:50]     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
[2022/10/21 23:39:50]     inet 127.0.0.1/8 scope host lo
[2022/10/21 23:39:50]     inet6 ::1/128 scope host 
[2022/10/21 23:39:50]        valid_lft forever preferred_lft forever
[2022/10/21 23:39:50] 2: eth2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
[2022/10/21 23:39:50]     link/ether 24:6e:96:5a:08:40 brd ff:ff:ff:ff:ff:ff
[2022/10/21 23:39:50] 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
[2022/10/21 23:39:50]     link/ether 24:6e:96:5a:08:42 brd ff:ff:ff:ff:ff:ff
[2022/10/21 23:39:50]     inet x.x.x.78/24 brd x.x.x.255 scope global eth1
[2022/10/21 23:39:50]     inet6 fe80::266e:96ff:fe5a:842/64 scope link 
[2022/10/21 23:39:50]        valid_lft forever preferred_lft forever
[2022/10/21 23:39:50] 4: p4p1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
[2022/10/21 23:39:50]     link/ether a0:36:9f:df:3c:6c brd ff:ff:ff:ff:ff:ff
[2022/10/21 23:39:50] 5: p4p2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
[2022/10/21 23:39:50]     link/ether a0:36:9f:df:3c:6e brd ff:ff:ff:ff:ff:ff

从库，主从 VIP 已绑定。

[2022/10/21 23:55:13] [root@DB ~]# ip a
[2022/10/21 23:55:13] 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN 
[2022/10/21 23:55:13]     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
[2022/10/21 23:55:13]     inet 127.0.0.1/8 scope host lo
[2022/10/21 23:55:13]     inet6 ::1/128 scope host 
[2022/10/21 23:55:13]        valid_lft forever preferred_lft forever
[2022/10/21 23:55:13] 2: eth4: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
[2022/10/21 23:55:13]     link/ether 24:6e:96:77:41:10 brd ff:ff:ff:ff:ff:ff
[2022/10/21 23:55:13] 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
[2022/10/21 23:55:13]     link/ether 24:6e:96:77:41:11 brd ff:ff:ff:ff:ff:ff
[2022/10/21 23:55:13]     inet x.x.x.24/24 brd x.x.x.255 scope global eth1
[2022/10/21 23:55:13]     inet x.x.x.20/32 scope global eth1
[2022/10/21 23:55:13]     inet x.x.x.21/32 scope global eth1
[2022/10/21 23:55:13]     inet6 fe80::266e:96ff:fe77:4111/64 scope link 
[2022/10/21 23:55:13]        valid_lft forever preferred_lft forever
[2022/10/21 23:55:13] 4: eth2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
[2022/10/21 23:55:13]     link/ether 24:6e:96:77:41:12 brd ff:ff:ff:ff:ff:ff
[2022/10/21 23:55:13] 5: eth3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
[2022/10/21 23:55:13]     link/ether 24:6e:96:77:41:13 brd ff:ff:ff:ff:ff:ff
[2022/10/21 23:55:13] [root@DB ~]# 
[2022/10/21 23:56:11] [root@DB ~]# ps -ef|grep vrrp
[2022/10/21 23:56:11] root      38256      1  0 23:26 ?        00:00:00 /export/servers/vrrp_scripts/vrrpd-1.0/vrrpd -i eth1 -n -v 20 -p 150 x.x.x.20
[2022/10/21 23:56:11] root      38279      1  0 23:26 ?        00:00:00 /export/servers/vrrp_scripts/vrrpd-1.0/vrrpd -i eth1 -n -v 21 -p 150 x.x.x.21
[2022/10/21 23:56:11] root      41028  40942  0 23:55 pts/1    00:00:00 grep vrrp

手动处理

由于主库已恢复，从库复制关系正常，因此将从库绑定的主 VIP 摘掉后绑定到主库，相当于回滚主从切换。

从库，kill 主 VIP，发现 kill 无效。

[2022/10/21 23:57:33] [root@DB ~]# kill -9 38256
[2022/10/21 23:57:33] [root@DB ~]# 
[2022/10/21 23:57:37] [root@DB ~]# ip a
[2022/10/21 23:57:37] 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN 
[2022/10/21 23:57:37]     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
[2022/10/21 23:57:37]     inet 127.0.0.1/8 scope host lo
[2022/10/21 23:57:37]     inet6 ::1/128 scope host 
[2022/10/21 23:57:37]        valid_lft forever preferred_lft forever
[2022/10/21 23:57:37] 2: eth4: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
[2022/10/21 23:57:37]     link/ether 24:6e:96:77:41:10 brd ff:ff:ff:ff:ff:ff
[2022/10/21 23:57:37] 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
[2022/10/21 23:57:37]     link/ether 24:6e:96:77:41:11 brd ff:ff:ff:ff:ff:ff
[2022/10/21 23:57:37]     inet x.x.x.24/24 brd x.x.x.255 scope global eth1
[2022/10/21 23:57:37]     inet x.x.x.20/32 scope global eth1
[2022/10/21 23:57:37]     inet x.x.x.21/32 scope global eth1
[2022/10/21 23:57:37]     inet6 fe80::266e:96ff:fe77:4111/64 scope link 
[2022/10/21 23:57:37]        valid_lft forever preferred_lft forever
[2022/10/21 23:57:37] 4: eth2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
[2022/10/21 23:57:37]     link/ether 24:6e:96:77:41:12 brd ff:ff:ff:ff:ff:ff
[2022/10/21 23:57:37] 5: eth3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
[2022/10/21 23:57:37]     link/ether 24:6e:96:77:41:13 brd ff:ff:ff:ff:ff:ff
[2022/10/21 23:57:37] [root@DB ~]# 
[2022/10/21 23:58:07] [root@DB ~]# ps -ef|grep vrrp
[2022/10/21 23:58:07] root      38279      1  0 23:26 ?        00:00:00 /export/servers/vrrp_scripts/vrrpd-1.0/vrrpd -i eth1 -n -v 21 -p 150 x.x.x.21
[2022/10/21 23:58:07] root      41216  40942  0 23:57 pts/1    00:00:00 grep vrrp
[2022/10/21 23:58:23] [root@DB ~]# 
[2022/10/21 23:58:27] [root@DB ~]# ip a
[2022/10/21 23:58:27] 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN 
[2022/10/21 23:58:27]     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
[2022/10/21 23:58:27]     inet 127.0.0.1/8 scope host lo
[2022/10/21 23:58:27]     inet6 ::1/128 scope host 
[2022/10/21 23:58:27]        valid_lft forever preferred_lft forever
[2022/10/21 23:58:27] 2: eth4: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
[2022/10/21 23:58:27]     link/ether 24:6e:96:77:41:10 brd ff:ff:ff:ff:ff:ff
[2022/10/21 23:58:27] 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
[2022/10/21 23:58:27]     link/ether 24:6e:96:77:41:11 brd ff:ff:ff:ff:ff:ff
[2022/10/21 23:58:27]     inet x.x.x.24/24 brd x.x.x.255 scope global eth1
[2022/10/21 23:58:27]     inet x.x.x.20/32 scope global eth1
[2022/10/21 23:58:27]     inet x.x.x.21/32 scope global eth1
[2022/10/21 23:58:27]     inet6 fe80::266e:96ff:fe77:4111/64 scope link 
[2022/10/21 23:58:27]        valid_lft forever preferred_lft forever
[2022/10/21 23:58:27] 4: eth2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
[2022/10/21 23:58:27]     link/ether 24:6e:96:77:41:12 brd ff:ff:ff:ff:ff:ff
[2022/10/21 23:58:27] 5: eth3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
[2022/10/21 23:58:27]     link/ether 24:6e:96:77:41:13 brd ff:ff:ff:ff:ff:ff

从库，使用ip addr del命令成功摘掉主 VIP。

[2022/10/21 23:58:59] [root@DB ~]# ip addr del x.x.x.20 dev eth1
[2022/10/21 23:58:59] Warning: Executing wildcard deletion to stay compatible with old scripts.
[2022/10/21 23:58:59]          Explicitly specify the prefix length (x.x.x.20/32) to avoid this warning.
[2022/10/21 23:58:59]          This special behaviour is likely to disappear in further releases,
[2022/10/21 23:58:59]          fix your scripts!
[2022/10/21 23:58:59] [root@DB ~]# 
[2022/10/21 23:59:11] [root@DB ~]# ip a
[2022/10/21 23:59:11] 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN 
[2022/10/21 23:59:11]     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
[2022/10/21 23:59:11]     inet 127.0.0.1/8 scope host lo
[2022/10/21 23:59:11]     inet6 ::1/128 scope host 
[2022/10/21 23:59:11]        valid_lft forever preferred_lft forever
[2022/10/21 23:59:11] 2: eth4: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
[2022/10/21 23:59:11]     link/ether 24:6e:96:77:41:10 brd ff:ff:ff:ff:ff:ff
[2022/10/21 23:59:11] 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
[2022/10/21 23:59:11]     link/ether 24:6e:96:77:41:11 brd ff:ff:ff:ff:ff:ff
[2022/10/21 23:59:11]     inet x.x.x.24/24 brd x.x.x.255 scope global eth1
[2022/10/21 23:59:11]     inet6 fe80::266e:96ff:fe77:4111/64 scope link 
[2022/10/21 23:59:11]        valid_lft forever preferred_lft forever
[2022/10/21 23:59:11] 4: eth2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
[2022/10/21 23:59:11]     link/ether 24:6e:96:77:41:12 brd ff:ff:ff:ff:ff:ff
[2022/10/21 23:59:11] 5: eth3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
[2022/10/21 23:59:11]     link/ether 24:6e:96:77:41:13 brd ff:ff:ff:ff:ff:ff
[2022/10/21 23:59:11] [root@DB ~]# 
[2022/10/21 23:59:17] [root@DB ~]# ps -ef|grep vrrp
[2022/10/21 23:59:17] root      41296  40942  0 23:58 pts/1    00:00:00 grep vrrp

主库，手动挂载主从 VIP。

[2022/10/21 23:58:16] [root@DB ~]# /export/servers/vrrp_scripts/vrrpd-1.0/vrrpd -i eth1 -n -v 20 -p 150 x.x.x.20
[2022/10/21 23:59:21] [root@DB ~]# /export/servers/vrrp_scripts/vrrpd-1.0/vrrpd -i eth1 -n -v 20 -p 150 x.x.x.21
[2022/10/21 23:59:26] [root@DB ~]# 
[2022/10/21 23:59:35] [root@DB ~]# ip a
[2022/10/21 23:59:35] 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN 
[2022/10/21 23:59:35]     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
[2022/10/21 23:59:35]     inet 127.0.0.1/8 scope host lo
[2022/10/21 23:59:35]     inet6 ::1/128 scope host 
[2022/10/21 23:59:35]        valid_lft forever preferred_lft forever
[2022/10/21 23:59:35] 2: eth2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
[2022/10/21 23:59:35]     link/ether 24:6e:96:5a:08:40 brd ff:ff:ff:ff:ff:ff
[2022/10/21 23:59:35] 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
[2022/10/21 23:59:35]     link/ether 24:6e:96:5a:08:42 brd ff:ff:ff:ff:ff:ff
[2022/10/21 23:59:35]     inet x.x.x.78/24 brd x.x.x.255 scope global eth1
[2022/10/21 23:59:35]     inet x.x.x.20/32 scope global eth1
[2022/10/21 23:59:35]     inet x.x.x.21/32 scope global eth1
[2022/10/21 23:59:35]     inet6 fe80::266e:96ff:fe5a:842/64 scope link 
[2022/10/21 23:59:35]        valid_lft forever preferred_lft forever
[2022/10/21 23:59:35] 4: p4p1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
[2022/10/21 23:59:35]     link/ether a0:36:9f:df:3c:6c brd ff:ff:ff:ff:ff:ff
[2022/10/21 23:59:35] 5: p4p2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
[2022/10/21 23:59:35]     link/ether a0:36:9f:df:3c:6e brd ff:ff:ff:ff:ff:ff
[2022/10/21 23:59:35] [root@DB ~]# 
[2022/10/21 23:59:38] [root@DB ~]# ps -ef|grep vrrp
[2022/10/21 23:59:38] root      49980      1  0 23:57 ?        00:00:00 /export/servers/vrrp_scripts/vrrpd-1.0/vrrpd -i eth1 -n -v 20 -p 150 x.x.x.20
[2022/10/21 23:59:38] root      50168      1  0 23:58 ?        00:00:00 /export/servers/vrrp_scripts/vrrpd-1.0/vrrpd -i eth1 -n -v 21 -p 150 x.x.x.21
[2022/10/21 23:59:38] root      50214  49983  0 23:59 pts/2    00:00:00 grep vrrp

随后业务恢复正常，接下来排查 MHA 主从切换失败的原因。

问题排查

首先排查日志，然后根据日志中的报错排查代码。

日志

mha_manager.log

日志文件：/export/servers/masterha/jd_mysql_ha/mha_manager.log

根据日志，梳理时间线：

2022-10-21 23:24:02，收到告警
2022-10-21 23:26:03，开始切换
2022-10-21 23:26:12，切换失败

详细日志：

2022-10-21 23:24:02,931 - apscheduler.executors.default - WARNING - found alert server DB/x.x.x.78 - MySQL已5分钟无响应
2022-10-21 23:24:02,931 - apscheduler.executors.default - WARNING - debug get_cluster_list host_list: [(u'x.x.x.78', u'DB/x.x.x.78')]
2022-10-21 23:24:02,944 - apscheduler.executors.default - WARNING - debug get_cluster_list cluster-cluster_id:1441
2022-10-21 23:24:02,944 - apscheduler.executors.default - WARNING - debug get_cluster_list cluster-cluster_source_ip:x.x.x.78
2022-10-21 23:24:02,944 - apscheduler.executors.default - WARNING - debug get_cluster_list cluster-cluster_other_ip:[u'x.x.x.24']
2022-10-21 23:24:02,944 - apscheduler.executors.default - WARNING - debug get_cluster_list cluster-cluster_status:1
2022-10-21 23:24:02,944 - apscheduler.executors.default - WARNING - debug get_cluster_list cluster-cluster_type:vip
2022-10-21 23:24:02,944 - apscheduler.executors.default - WARNING - debug get_cluster_list cluster-master:{'instance_ip': u'x.x.x.78', 'role': 'master', 'vips': [u'x.x.x.20', u'x.x.x.21']}
2022-10-21 23:24:02,945 - apscheduler.executors.default - WARNING - debug get_cluster_list cluster-slave_list:[{'role': 'slave', 'instance_ip': u'x.x.x.24', 'vips': []}]
2022-10-21 23:24:03,138 - apscheduler.executors.default - WARNING - cluster 1441 中x.x.x.24存活检查:True!
2022-10-21 23:24:03,138 - apscheduler.executors.default - WARNING - host info : {'instance_ip': u'x.x.x.78', 'role': 'master', 'vips': [u'x.x.x.20', u'x.x.x.21']}  type: <type 'dict'> !
2022-10-21 23:24:03,138 - apscheduler.executors.default - WARNING - Found alert cluster 1441
2022-10-21 23:24:03,138 - apscheduler.executors.default - WARNING - begin add master switch task
2022-10-21 23:24:03,139 - apscheduler.executors.default - WARNING - end add master switch task
2022-10-21 23:24:13,176 - check_alive - ERROR - (2013, 'Lost connection to MySQL server during query (timed out)')
2022-10-21 23:24:33,216 - check_alive - ERROR - (2013, 'Lost connection to MySQL server during query (timed out)')
2022-10-21 23:24:53,255 - check_alive - ERROR - (2013, 'Lost connection to MySQL server during query (timed out)')
2022-10-21 23:25:13,294 - check_alive - ERROR - (2013, 'Lost connection to MySQL server during query (timed out)')
2022-10-21 23:26:03,373 - apscheduler.executors.default - WARNING - Start mha failover, see the cluster 1441 log
2022-10-21 23:26:03,373 - apscheduler.executors.default - WARNING - *****主库宕机，切换开始*****
2022-10-21 23:26:12,092 - apscheduler.executors.default - WARNING - *****主库宕机，切换结束*****
2022-10-21 23:26:12,093 - apscheduler.executors.default - WARNING - master failover failed!

日志显示探活报错 Lost connection to MySQL server during query (timed out)，表明数据库无法连接。

online_switch.log

日志文件：/var/log/masterha/app1/switch/online_switch.log

2022-10-21 23:26:03,373 - <module> - INFO - lock_key: lock_1441_master_x.x.x.78_err ,unique value: 13033-139684285337264
2022-10-21 23:26:03,373 - apscheduler.executors.default - WARNING - Start mha failover, see the cluster 1441 log
2022-10-21 23:26:03,373 - apscheduler.executors.default - WARNING - *****主库宕机，切换开始***** 
2022-10-21 23:26:03,378 - root - INFO - [server default]
manager_log=/export/Logs/jd_mysql_ha/cluster_1441-202210212326.log
manager_workdir=/export/Logs/jd_mysql_ha
master_binlog_dir=/export/data/mysql/data
master_ip_failover_script=/export/servers/masterha/jd_mysql_ha/failover_vip.py
master_ip_online_change_script=/export/servers/masterha/jd_mysql_ha/online_change_vip.py
report_script=/export/servers/masterha/jd_mysql_ha/send_report.py
password=xxx
ping_interval=1
remote_workdir=/tmp
repl_password=xxx
repl_user=replicater
shutdown_script=""
ssh_user=root
user=mha_admin
use_gtid_auto_pos=0


[server_x.x.x.78]
hostname=x.x.x.78
port=3358


[server_x.x.x.24]
hostname=x.x.x.24
port=3358


2022-10-21 23:26:12,092 - <module> - INFO - --dead_master_ip=<dead_master_ip> is not set. Using x.x.x.78.

2022-10-21 23:26:12,092 - <module> - INFO - Fri Oct 21 23:26:03 2022 - [info] Reading default configuration from /etc/masterha_default.cnf..
Fri Oct 21 23:26:03 2022 - [info] Reading application default configuration from /export/servers/masterha/etc/cluster_1441.conf..
Fri Oct 21 23:26:03 2022 - [info] Reading server configuration from /export/servers/masterha/etc/cluster_1441.conf..

2022-10-21 23:26:12,092 - apscheduler.executors.default - WARNING - *****主库宕机，切换结束*****
2022-10-21 23:26:12,093 - <module> - INFO - delete -- lock_key: lock_1441_master_x.x.x.78_err ,old value: 13033-139684285337264 ,unique value: 13033-139684285337264
2022-10-21 23:26:12,093 - apscheduler.executors.default - WARNING - master failover failed!

cluster_1441.log

集群切换日志文件：/export/Logs/jd_mysql_ha/cluster_1441-202210212326.log

显示 ssh 通，但是数据库无法连接。

Fri Oct 21 23:26:04 2022 - [info] Checking master reachability via MySQL(double check)...
Fri Oct 21 23:26:05 2022 - [info]  ok.
...
Fri Oct 21 23:26:05 2022 - [info] * Phase 2: Dead Master Shutdown Phase..
Fri Oct 21 23:26:05 2022 - [info]
Fri Oct 21 23:26:06 2022 - [info] HealthCheck: SSH to x.x.x.78 is reachable.
Fri Oct 21 23:26:07 2022 - [info] Forcing shutdown so that applications never connect to the current master..
Fri Oct 21 23:26:07 2022 - [info] Executing master IP deactivation script:
Fri Oct 21 23:26:07 2022 - [info]   /export/servers/masterha/jd_mysql_ha/failover_vip.py --orig_master_host=x.x.x.78 --orig_master_ip=x.x.x.78 --orig_master_port=3358 --command=stopssh --ssh_user=root
Fri Oct 21 23:26:08 2022 - [info]  done.
Fri Oct 21 23:26:08 2022 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Fri Oct 21 23:26:08 2022 - [info] * Phase 2: Dead Master Shutdown Phase completed.

MHA 切换阶段二 Dead Master Shutdown Phase 用于防止老主写入，首先调用 failover_vip.py 脚本摘掉老主上的 VIP，如果配置了关闭数据库脚本，也会关闭数据库，线上没有配置，因此不会关闭数据库。

到这里查到了切换报错的位置，日志显示执行 failover_vip.py 脚本时报错，MasterFailover.pm 返回码错误。

Fri Oct 21 23:26:09 2022 - [info] Executing master IP activate script:
Fri Oct 21 23:26:09 2022 - [info]   /export/servers/masterha/jd_mysql_ha/failover_vip.py --command=start --ssh_user=root --orig_master_host=x.x.x.78 --orig_master_ip=x.x.x.78 --orig_master_port=3358 --new_master_host=x.x.x.24 --new_master_ip=x.x.x.24 --new_master_port=3358 --new_master_user='mha_admin' --new_master_password=''  
(u'x.x.x.20', u'x.x.x.15')
ls /export/servers/vrrp_scripts/vrrpd-1.0/vrrpd &&         /export/servers/vrrp_scripts/vrrpd-1.0/vrrpd -i eth1         -n -v 20 -p 150 x.x.x.20 > /dev/null 2>/export/Logs/vrrpd.log
(u'x.x.x.21', u'x.x.x.16')
ls /export/servers/vrrp_scripts/vrrpd-1.0/vrrpd &&         /export/servers/vrrp_scripts/vrrpd-1.0/vrrpd -i eth1         -n -v 21 -p 150 x.x.x.21 > /dev/null 2>/export/Logs/vrrpd.log
Fri Oct 21 23:26:11 2022 - [error][/usr/local/share/perl5/MHA/MasterFailover.pm, ln1628]  Failed to activate master IP address for x.x.x.24(x.x.x.24:3358) with return code 1:0
Fri Oct 21 23:26:11 2022 - [error][/usr/local/share/perl5/MHA/ManagerUtil.pm, ln177] Got ERROR:  at /usr/local/bin/masterha_master_switch line 53 
Fri Oct 21 23:26:11 2022 - [info] 

----- Failover Report -----

cluster_1441: MySQL Master failover x.x.x.78(x.x.x.78:3358)

Master x.x.x.78(x.x.x.78:3358) is down!

那么，failover_vip.py 脚本给新主绑定 vip 时报错，但实际上是否绑定成功呢？

从新主系统日志（/var/log/messages）中可以看到，从 23:26:08 时刻开始，主从 VIP 已挂载到新主。

因此，VIP 挂载已完成，但是返回给 MHA 错误。

代码

定时任务

后台运行脚本执行 MHA 故障切换（死切）。

[root@MHA ~]# ps -ef|grep mha|grep -v grep
root     22300     1  0  2022 ?        00:08:41 python27 /export/servers/masterha/jd_mysql_ha/mha_manager.py

脚本中基于 apscheduler 创建定时任务，每 90 秒运行一次 main 方法。

import sys
from apscheduler.schedulers.blocking import BlockingScheduler

from libs.sys_lib import getlogger


if __name__ == "__main__":
    reload(sys)
    sys.setdefaultencoding('utf8')
    log = getlogger(Config.log_file, 'apscheduler.executors.default')
    sched = BlockingScheduler()
    sched.add_job(main, 'interval', seconds=90, max_instances=10)
    sched.start()

注意：

定时任务的执行周期可以分为两种类型：

crontab，表示每隔多少时间执行；

interval，表示固定在某一时刻执行。

创建切换任务

main 方法的主要逻辑：

get_alert_host_list，根据监控数据获取到服务器故障或 mysql 挂的数据库；
get_cluster_list，从数据库中查询集群信息，如 ip 与 vip；
当主库宕机时，异步执行 switch_master 方法。

def main():
    # 服务器故障或mysql挂的切换
    host_list = get_alert_host_list()
    if host_list is not None:
        cluster_list = get_cluster_list(host_list,1)
        if cluster_list == []:
            return
        tasks = []
        for cluster in cluster_list:
            if cluster.get_role(cluster.source_ip) == 'slave':
                log.warning('begin add slave switch task')
                tasks.append(gevent.spawn(switch_slave, cluster))
                log.warning('end add slave switch task')
            else:
                log.warning('begin add master switch task')
                tasks.append(gevent.spawn(switch_master, cluster))
                log.warning('end add master switch task')
        gevent.joinall(tasks)
    ...

switch_master 方法的主要逻辑：

check_alive，主库探活；
基于 redis 创建分布式锁，防止同一个任务被多个线程执行；
调用 mha.run_switch 方法进行切换；
释放分布式锁。

from libs.redis_lib import RedisLock

Redislk = RedisLock()


def switch_master(cluster):
    try:
        # 探活！
        Config.mha_conn['host'] = cluster.source_ip
        host_status = check_alive(Config.mha_conn, Config.log_file)
        if host_status is True:
            log.warning("The alert server %s seems ok." % cluster.source_ip)
            return False
        else:
            # 分布式锁，防止同一个任务被多个线程执行！
            switch_key = "%s_master_%s_err" % (cluster.cluster_id, cluster.master['instance_ip'])
            switch_value = Redislk.get(switch_key)
            if not switch_value:
                log.warning('Other thread is doing master switch!')
                return False
            log.warning(
                "Start mha failover, see the cluster %s log" % cluster.cluster_id
            )

            log.warning('*****主库宕机，切换开始*****')
            mha = Mha(cluster)
            mha.run_switch()
            log.warning('*****主库宕机，切换结束*****')
            with open(mha.manage_log, 'r') as f:
                message_to_db = f.read()
            mha.remove_conf()
            log_status = mha.mha_log_status()
            # 释放分布式锁！
            Redislk.delete(switch_key, switch_value)

注意在切换前后分别获取与释放分布式锁避免重复切换。

mha.run_switch方法中调用/usr/local/bin/masterha_master_switch切换工具。

class Mha():
 def run_switch(self):
        try:
            if self.candidate_host is not None:
                proc = subprocess.Popen(
                    [
                        'masterha_master_switch', ('--conf=%s' % self.conf_file),
                        '--master_state=dead', '--interactive=0',
                        ('--dead_master_host=%s' % self.master_host),
                        '--dead_master_port=3358',
                        ('--new_master_host=%s' % self.candidate_host),
                        '--new_master_port=3358'
                    ], shell=False, stdin=subprocess.PIPE, stdout=subprocess.PIPE
                )
                out, err = proc.communicate()
            else:
                exec_cmd = '/usr/local/bin/masterha_master_switch' + ' --conf=%s ' % self.conf_file + '--master_state=dead --interactive=0' + ' --dead_master_host=%s' % self.master_host + ' --dead_master_port=3358'
                proc = subprocess.Popen(exec_cmd, shell=True, stdin=subprocess.PIPE, stdout=subprocess.PIPE,stderr=subprocess.PIPE)
                out, err = proc.communicate()
                log.info(str(out))
                log.info(str(err))
                return out, err
        except Exception as ex:
            log.info('run_switch ERROR:%s' % str(ex))

其中：

**调用subprocess.Popen创建子进程，用于调用masterha_master_switch**；
调用subprocess.Popen.communicate获取子进程的返回值。

下面进入 MHA 程序。

MHA 切换

cluster_1441.log 日志中显示连通性校验通过，可是这里的通过是指可以连接还是不可以连接呢？

Fri Oct 21 23:26:04 2022 - [info] Checking master reachability via MySQL(double check)...
Fri Oct 21 23:26:05 2022 - [info]  ok.
...
Fri Oct 21 23:26:05 2022 - [info] * Phase 2: Dead Master Shutdown Phase..

根据Checking master reachability关键字查询/usr/local/share/perl5/MHA/MasterFailover.pm源码，显示切换前会检测主库的连通性。

# quick check that the dead server is really dead
# not double check when ping_type is insert,
# because check_connection_fast_util can rerurn true if insert-check detects I/O failure.
  if ( $servers_config[0]->{ping_type} ne $MHA::ManagerConst::PING_TYPE_INSERT )
  {
    $log->info("Checking master reachability via MySQL(double check)...");
    if ( 
      my $rc = MHA::DBHelper::check_connection_fast_util(
        $dead_master->{hostname}, $dead_master->{port},
        $dead_master->{user},     $dead_master->{password}
      )    
      )    
    {    
      $log->error(
        sprintf(
          "The master %s is reachable via MySQL (error=%s) ! Stop failover.",
          $dead_master->get_hostinfo(), $rc
        )    
      );   
      croak;
    }    
    $log->info(" ok.");
  }

如果数据库可以连接，中止切换，因此如果 ok 之前没有打印报错，表明数据库无法连接。因此切换前的二次检查通过，表明数据库无法连接。

随后 cluster_1441.log 日志打印报错。

Fri Oct 21 23:26:11 2022 - [error][/usr/local/share/perl5/MHA/MasterFailover.pm, ln1628]  Failed to activate master IP address for x.x.x.24(x.x.x.24:3358) with return code 1:0
Fri Oct 21 23:26:11 2022 - [error][/usr/local/share/perl5/MHA/ManagerUtil.pm, ln177] Got ERROR:  at /usr/local/bin/masterha_master_switch line 53

查看报错行的代码，MasterFailover.pm第 1628 行，显示调用master_ip_failover_script脚本时报错。

  if ( $new_master->{master_ip_failover_script} ) {
    my $command =
"$new_master->{master_ip_failover_script} --command=start --ssh_user=$new_master->{ssh_user} --orig_master_host=$dead_master->{hostname} --orig_master_ip=$dead_master->{ip} --orig_master_port=$dead_master->{port} --new_master_host=$new_master->{hostname} --new_master_ip=$new_master->{ip} --new_master_port=$new_master->{port} --new_master_user=$new_master->{escaped_user} --new_master_password=$new_master->{escaped_password}";
 ...
    my ( $high, $low ) = MHA::ManagerUtil::exec_system( $command, $g_logfile );
    if ( $high == 0 && $low == 0 ) {
      $log->info(" OK.");
      $mail_body .=
        $new_master->get_hostinfo() . ": OK: Activated master IP address.n";
    }    
    else {
      my $message =
          "Failed to activate master IP address for "
        . $new_master->get_hostinfo()
        . " with return code $high:$low";
      $log->error( " " . $message ); # ln1628

查看配置文件/export/servers/masterha/etc/cluster_1441.conf，显示master_ip_failover_script 脚本对应failover_vip.py脚本。

master_ip_failover_script=/export/servers/masterha/jd_mysql_ha/failover_vip.py

切换脚本

failover_vip.py脚本中根据传入的命令行参数来摘掉或绑定 VIP。

上文 MHA 的command中指定--command=start，因此表明绑定 VIP 报错。

if __name__ == "__main__":
    parser = optparse.OptionParser()
    ...
    orig_master_host = options.orig_master_host
    new_master_host = options.new_master_host
    vips = get_vip(orig_master_host)
    if command == 'stop' or command == 'stopssh':
        stop_vips(orig_master_host)
    elif command == 'start':
        # 获取网卡
        interface = get_interface(new_master_host)
        # 将vip绑定到ip的指定网卡
        out, err = start_vip(new_master_host, vips, interface[0])
        if err is None:
            sys.exit(0)
        else:
            sys.exit(1)
    elif command == 'status':
        print 'OK'

start_vip方法中同样使用subprocess.Popen创建子进程绑定 VIP。

def start_vip(host, vips, interface):
    out_all = ''
    err_all = ''
    for vip in vips:
        print vip
        ##ip_tail = vip[1].split('.')[-1]
        ip_tail = vip[0].split('.')[-1]
        cmd = '/export/servers/vrrp_scripts/vrrpd-1.0/vrrpd -i %s 
        -n -v %s -p 150 %s >> /export/Logs/vrrpd.log 2>&1' % (
            interface, ip_tail, vip[0]
        )
        print cmd
        proc = subprocess.Popen(
            ['ssh', host, cmd], shell=False,
            stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE
        )
        out, err = proc.communicate()
        if err is None:
            err = ''
        out_all = out_all + out
        err_all = err_all + err
    return out_all, err_all

其中，返回异常时没有打印日志。因此为排查问题，在 debug 过程中新增日志。

        out, err = proc.communicate()
        if err is None:
            err = ''
        else:
            print 'err:' + err  # 新增

优化

添加日志

failover_vip.py脚本中start_vip方法也会返回捕获的异常，但是当前日志中未打印。

        out, err = start_vip(new_master_host, vips, interface[0])
        if err is None:
            sys.exit(0)
        else:
            sys.exit(1)

因此，打印start_vip方法的返回值中的报错。

        interface = get_interface(new_master_host)
        out, err = start_vip(new_master_host, vips, interface[0])
        if err is None:
            sys.exit(0)
        else:
            print 'err:' + err  # 新增
            print 'sys.exit:' + '1'  # 新增
            sys.exit(1)

测试结果如下所示，成功复现该问题，子进程绑定 VIP 返回异常导致 MHA 切换失败。

更换调用方式

同事反馈子进程的方式坑比较多，因此将绑定 vip 时使用子进程的方式替换为远程调用的方式。

如下所示，start_vip方法中调用cmd_remote方法绑定 vip。

def start_vip(host, vips, interface):
    out_all = ''
    err_all = None
 ...
    
    for vip in vips:
        # 挂载vip
        cmd = "nohup /export/servers/vrrp_scripts/vrrpd-1.0/vrrpd -i %s -n -v %s -p 150 %s" % (
            interface, ip_tail, ip_vip)
        res_cmd = cmd_remote(host, cmd)
        if not res_cmd['success']:
            err_all = res_cmd['message']
            return out_all, err_all

cmd_remote方法中基于paramiko模块通过指定 key 的方式登陆服务器执行 shell 命令。

import paramiko


def cmd_remote(ip, cmd):
    success = False
    message = ''
    data = {}
    status = None
    try:
        pkey = paramiko.RSAKey.from_private_key_file('/root/.ssh/id_rsa')
        ssh = paramiko.SSHClient()
        ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
        ssh.connect(hostname=ip,
                    port=22,
                    username='root',
                    pkey=pkey)
        if cmd.startswith('nohup'):
            invoke = ssh.invoke_shell()
            invoke.send(cmd + ' & n ')
            # 等待命令执行完成
            time.sleep(3)
        else:
            # 结果放到stdout中，如果有错误将放到stderr中
            stdin, stdout, stderr = ssh.exec_command(cmd, timeout=10)
            data = {'stdout': [item.strip() for item in stdout.readlines()],
                    'stderr': [item.strip() for item in stderr.readlines()],
                    'status': stdout.channel.recv_exit_status()
                    }
        success = True
    except Exception as ex:
        message = "func cmd_remote error: %s" % (str(ex))
        log.error(message)
        success = False
    finally:
        ssh.close()
        return {'success': success, 'message': message, 'data': data}

kill.sh

上面提到，登录服务器后发现大量 kill.sh 脚本执行中。

该脚本用于 kill 报表库查询超过 10 分钟的 select 语句。

[root@DB ~]# cat /export/mysql_script/mysql_tools/kill.sh 
#!/bin/bash

today=`date -d today +"%Y%m%d"`
DB_list_SQL=`/export/servers/mysql/bin/mysql -uadmin -p -e"show  databases;"`
DB_slave_status_SQL=`/export/servers/mysql/bin/mysql -uadmin -p -e"show slave statusG;"  |grep Master_Host  |awk -F ': ' '{print $2}'`
DB_name="report"
result=$(echo $DB_list_SQL |grep "$DB_name")
function kill_exec_sql_greater_than10min()
{
        if [ "$DB_slave_status_SQL" != "" -a "$DB_slave_status_SQL" != "1.1.1.1" -a "$result"!="" ] ;then
        echo 'this server is report slave server'
      `/export/servers/mysql/bin/mysql -uadmin -p -e "select *,now() from information_schema.processlist  where info is not null and user in(report_ro','rp_rw') and COMMAND <> 'Killed' and time>600 ;" >> /export/data/mysql/log/killsql.log.$today`
      echo "recover"
      `/export/servers/mysql/bin/mysql -uadmin -p -N -e "select concat('KILL ',id,';') from information_schema.processlist where info is not null and user in (report_ro','rp_rw') and COMMAND <> 'Killed' and time>600 ;" > /tmp/processlist_kill.txt`
      `/export/servers/mysql/bin/mysql -uadmin -p -e "source /tmp/processlist_kill.txt;"`

        else
        echo 'this server is  master server or report master'
        fi
}

kill_exec_sql_greater_than10min

查看 crontab 定时任务，显示该任务每分钟执行一次。

[root@DB ~]# crontab -l|grep kill
*/1 * * * * sh /export/mysql_script/mysql_tools/kill.sh > /dev/null 2>&1

实际上，分散在服务器上的 crontab 任务不易管理，因此已将该任务迁移至 celery 集中管理，并将执行频率从每 1 分钟一次调整为每 10 分钟一次，而该服务器漏掉了。

知识点

MHA

MHA 架构

MHA（Master High Availability）是由日本人开发的一款成熟且开源的 MySQL 高可用程序，基于 Perl 语言开发。

MHA 可以在 0-30秒内平滑的完成数据库故障的自动切换，并支持数据补录和复制关系重建，最大程度上保证数据一致性。

MHA 搭建的条件最少是一主两从，且必须是独立的服务器，不能单机多实例进行搭建。

MHA 整体架构设计如下图所示，该软件由两部分组成，MHA Manager（管理节点）和 MHA Node（数据节点）。两种节点都由多个工具包组成，因此 MHA 本质是就是一个工具包。

Manager 节点和 Node 节点从架构设计上是服务端-客户端模式。其中 Manager 可以单独部署在一台独立的机器上管理多个 master-slave 集群，也可以部署在一台 slave 节点上。Node 运行在每台 MySQL 服务器上。

需要注意的是 MHA 仅负责数据库层面的高可用，所以通常需要和其他方案组合起来，常见的组合方式有：

MHA + VIP
MHA + KeepAlive
MHA + Zookeeper

其中 MHA + VIP 是一种相对成熟和经典的方案，架构简单，容易落地。应用统一使用 VIP 访问数据库，VIP 可以根据节点的数据状态在不同节点之间漂移，实现无缝切换的高可用。

本文中采用的就是这种方案实现高可用。

VIP

VIP （Vistual IP）值虚拟 ip，是临时绑定在物理网卡上的别名，一块网卡上可以绑定多个别名。

VIP 主要作为可漂移资源用于服务器的主从切换，可以在业务不改变连接字符串的前提下，完成业务连接目标的切换。

业务应用访问数据库时通常通过一个类似于 URL 的连接字符串来定位具体服务的网络地址。

如jdbc:mysql://192.168.1.1:3306/mydb。

其中最重要的是定位手段是“地址+端口”部分，这个地址可以是 VIP，也可以是 DNS 域名。

因此在业务不改变连接字符串的前提下，可以通过高可用切换时漂移 IP 或域名的方式完成业务连接目标的切换。

可漂移资源主要是网络资源，并不是数据库服务固有的资源。因此该方案多用于自身缺乏高可用保障机制，需要外部框架来完成监控 + 切换流程的数据库服务，是一种较原始的连接切换方案。

其中域名切换比 VIP 切换更可靠。原因是，VIP 需要保证全局唯一，切换操作包括两个步骤，首先在故障节点删除 VIP，然后才可以在切换目标节点添加 VIP。而故障节点删除 VIP 无法保证成功率，如果 VIP 所在故障节点出现复杂问题，有可能出现数据库被判为不可用，但是 VIP 未离线，导致删除 VIP 操作失败。

域名作为漂移资源是由 DNS 服务集中管理控制，切换本身不依赖也不需要对故障节点进行任何操作，而 DNS 服务的高可用和冗余机制是另外一套较为完善的体系，因此基于 DNS 的域名切换是一种更可靠，成功率明显更高的漂移资源切换方案。

但是域名切换也存在痛点，如域名的应用端缓存问题。

本文中 VIP 通过 VRRP 协议进行管理。

VRRP

主机之间的通信都是通过静态路由或默认网关完成，这种通信模式中，路由器成为单点瓶颈，因此当主机之间的路由器出现故障时，通信就会失效，因此引入 VRRP 协议。

VRRP（Virtual Router Redundancy Protocol，虚拟路由冗余协议）是一种为避免路由器出现单点故障的主备模式的容错协议。

具体是将多个物理路由器设备虚拟成一个虚拟路由，这个虚拟路由通过 VIP（一个或多个）对外提供服务。组成虚拟路由器的一组路由器可以分为两种角色：

Master 角色，同一时间只有一台物理路由器对外提供各种网络功能，如 ARP 请求，ICMP 数据转发等，这台物理路由设备称为主路由器；
Backup 角色，其他物理路由器均作为备路由器，不拥有对外的虚拟IP，也不提供对外网络功能，仅仅接收MASTER 的 VRRP 状态通告信息。

可见，VRRP 协议中多个物理路由器设备存在选主环节。实际上，VRRP 根据优先级来确定虚拟路由器中每台设备的角色，Master 设备或 Backup 设备。优先级越高，则越有可能成为 Master 设备。

如下所示，可以通过命令绑定与摘除 VIP。

/export/servers/vrrp_scripts/vrrpd-1.0/vrrpd -i eth1 -n -v 21 -p 150 x.x.x.21
ip addr del x.x.x.21 dev eth1

其中：

-p 参数支持指定物理路由设备的优先级，虚拟路由器根据优先级选举出 Master 和 Backup。因此如果多个物理路由设备绑定同一个 VIP 时，优先级最高的设备作为 Master 对外提供服务。

如下所示，MHA 故障切换过程中新主的系统日志中显示该物理路由器设备由 Backup 提升为 Master。

Oct 21 23:26:08 HOSTNAME vrrpd: VRRP ID 20 on eth1: we are now a backup router.
Oct 21 23:26:08 HOSTNAME vrrpd: VRRP ID 21 on eth1: we are now a backup router.
...
Oct 21 23:26:10 HOSTNAME vrrpd: VRRP ID 20 on eth1: we are now the master router.
Oct 21 23:26:11 HOSTNAME vrrpd: VRRP ID 21 on eth1: we are now the master router.