引言
MHA 是一款成熟且开源的 MySQL 高可用程序,本文介绍一个子进程绑定 VIP 返回异常导致 MHA 切换失败的案例。
介绍
2022-10-21 23:23:23 告警显示一台数据库无响应,而主从切换程序显示切换失败,但是并不确定切换进度。
处理过程中发现,主库可以 ping 通,ssh 也通,但是登录数据库失败。从库复制关系正常。
大约三分钟以后主库登录数据库成功,但是业务反馈写入失败,报错只读,检查后才发现是主 VIP 已切换到从库,而从库未关闭只读,手动将主 VIP 绑定到主库后恢复正常。
实际上,在第一时间发现问题时尽管对于 MHA 切换流程不熟悉,但如果严格按照切换逻辑检查,完全可以及时处理进而缩短对业务的影响时间。原因三分钟就可以解决的问题最终用了半小时才解决。
在检查切换进度时首先需要关注 vip 的绑定关系是否正确,原因是 MHA 切换一开始就会将 vip 摘掉。
通过排查日志与代码,分析并复现出该问题,子进程绑定 VIP 返回异常导致 MHA 切换失败。
现场处理
监控
2022-10-21 23:23:23 zabbix 告警显示一台数据库无响应,由于当时没有截图保存,因此查看告警记录。
mysql> select * from sms where host like '%x.x.x.78%' and date_format(time,"%Y-%m-%d")='2022-10-21' limit 1 G
*************************** 1. row ***************************
id: 2598165
time: 2022-10-21 23:23:23
host: x.x.x.78
TRIGGER_NAME: MySQL已5分钟无响应
DETAIL: 详细:MySQL is alive:1
STATUS: PROBLEM
type: 0
zone: DB
send_type: 1
sms_send_status: 1
alarm_type: 0
1 row in set (2.02 sec)
登录服务器显示 IO 打满,IO 繁忙度超过 90%。
[2022/10/21 23:31:20] [root@DB ~]# iostat -x 1 10
[2022/10/21 23:31:20] Linux 2.6.32-431.el6.x86_64 (DB10110078) 10/21/2022 _x86_64_ (32 CPU)
[2022/10/21 23:31:20]
[2022/10/21 23:31:20] avg-cpu: %user %nice %system %iowait %steal %idle
[2022/10/21 23:31:20] 3.66 0.00 0.47 0.64 0.00 95.22
[2022/10/21 23:31:20]
[2022/10/21 23:31:20] Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
[2022/10/21 23:31:20] sda 3.46 1116.02 660.05 1354.14 21223.36 30133.13 25.50 0.20 0.10 0.10 20.32
[2022/10/21 23:31:20]
[2022/10/21 23:31:21] avg-cpu: %user %nice %system %iowait %steal %idle
[2022/10/21 23:31:21] 0.47 0.00 3.19 2.56 0.00 93.78
[2022/10/21 23:31:21]
[2022/10/21 23:31:21] Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
[2022/10/21 23:31:21] sda 0.00 0.00 2222.00 0.00 71088.00 0.00 31.99 0.87 0.39 0.39 87.10
[2022/10/21 23:31:21]
[2022/10/21 23:31:22] avg-cpu: %user %nice %system %iowait %steal %idle
[2022/10/21 23:31:22] 0.41 0.00 3.22 2.66 0.00 93.72
[2022/10/21 23:31:22]
[2022/10/21 23:31:22] Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
[2022/10/21 23:31:22] sda 0.00 0.00 2320.00 0.00 74240.00 0.00 32.00 0.91 0.39 0.39 90.60
[2022/10/21 23:31:22]
[2022/10/21 23:31:23] avg-cpu: %user %nice %system %iowait %steal %idle
[2022/10/21 23:31:23] 0.37 0.00 3.19 2.75 0.00 93.69
[2022/10/21 23:31:23]
[2022/10/21 23:31:23] Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
[2022/10/21 23:31:23] sda 0.00 0.00 2291.00 0.00 73312.00 0.00 32.00 0.92 0.40 0.40 92.20
[2022/10/21 23:31:23]
[2022/10/21 23:31:24] avg-cpu: %user %nice %system %iowait %steal %idle
[2022/10/21 23:31:24] 0.44 0.00 3.19 2.63 0.00 93.75
[2022/10/21 23:31:24]
[2022/10/21 23:31:24] Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
[2022/10/21 23:31:24] sda 0.00 2.00 2193.00 4.00 70176.00 48.00 31.96 0.90 0.41 0.41 89.60
[2022/10/21 23:31:24]
[2022/10/21 23:31:25] avg-cpu: %user %nice %system %iowait %steal %idle
[2022/10/21 23:31:25] 0.59 0.00 3.19 2.53 0.00 93.69
[2022/10/21 23:31:25]
[2022/10/21 23:31:25] Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
[2022/10/21 23:31:25] sda 0.00 0.00 2113.00 0.00 67616.00 0.00 32.00 0.84 0.40 0.40 84.40
[2022/10/21 23:31:25]
[2022/10/21 23:31:26] avg-cpu: %user %nice %system %iowait %steal %idle
[2022/10/21 23:31:26] 0.44 0.00 3.41 2.69 0.00 93.47
[2022/10/21 23:31:26]
[2022/10/21 23:31:26] Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
[2022/10/21 23:31:26] sda 0.00 3.00 2281.00 3.00 72992.00 48.00 31.98 0.92 0.40 0.40 91.50
[2022/10/21 23:31:26]
[2022/10/21 23:31:27] avg-cpu: %user %nice %system %iowait %steal %idle
[2022/10/21 23:31:27] 0.25 0.00 3.19 2.78 0.00 93.78
[2022/10/21 23:31:27]
[2022/10/21 23:31:27] Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
[2022/10/21 23:31:27] sda 0.00 0.00 2381.00 0.00 76192.00 0.00 32.00 0.96 0.40 0.40 95.80
[2022/10/21 23:31:27]
[2022/10/21 23:31:28] avg-cpu: %user %nice %system %iowait %steal %idle
[2022/10/21 23:31:28] 0.44 0.00 3.19 2.60 0.00 93.78
[2022/10/21 23:31:28]
[2022/10/21 23:31:28] Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
[2022/10/21 23:31:28] sda 0.00 3.00 2181.00 7.00 69792.00 80.00 31.93 0.87 0.40 0.40 87.10
[2022/10/21 23:31:28]
[2022/10/21 23:31:29] avg-cpu: %user %nice %system %iowait %steal %idle
[2022/10/21 23:31:29] 0.56 0.00 3.16 2.50 0.00 93.78
[2022/10/21 23:31:29]
[2022/10/21 23:31:29] Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
[2022/10/21 23:31:29] sda 0.00 0.00 2147.00 0.00 68704.00 0.00 32.00 0.85 0.39 0.39 84.60
登录数据库
登录数据库失败
[2022/10/21 23:28:19] root@DB # /export/servers/mysql/bin/mysql -uadmin -p -hx.x.x.78
[2022/10/21 23:28:21] Enter password:
[2022/10/21 23:28:21]
[2022/10/21 23:28:51] ^C
登录服务器
服务器可以 ping 通
[2022/10/21 23:30:55] root@DB # ping x.x.x.78
[2022/10/21 23:30:55] PING x.x.x.78 (x.x.x.78) 56(84) bytes of data.
[2022/10/21 23:30:55] 64 bytes from x.x.x.78: icmp_seq=1 ttl=54 time=36.4 ms
[2022/10/21 23:30:56] 64 bytes from x.x.x.78: icmp_seq=2 ttl=54 time=36.2 ms
[2022/10/21 23:30:56] ^C
[2022/10/21 23:30:56] --- x.x.x.78 ping statistics ---
[2022/10/21 23:30:56] 2 packets transmitted, 2 received, 0% packet loss, time 1508ms
[2022/10/21 23:30:56] rtt min/avg/max/mdev = 36.200/36.313/36.426/0.113 ms
自动切换异常
主从切换程序显示切换失败。
切换进度
由于不确定切换进度,因此查看了下从库的复制状态,结果显示主从复制关系还在,其中 IO 线程与 SQL 线程中断。
[2022/10/21 23:28:54] root@DB # /export/servers/mysql/bin/mysql -uadmin -p -hx.x.x.24
[2022/10/21 23:28:56] mysql>
[2022/10/21 23:28:57] mysql> SHOW SLAVE STATUS G
[2022/10/21 23:28:57] *************************** 1. row ***************************
[2022/10/21 23:28:57] Slave_IO_State:
[2022/10/21 23:28:57] Master_Host: x.x.x.78
[2022/10/21 23:28:57] Master_User: replicater
[2022/10/21 23:28:57] Master_Port: 3358
[2022/10/21 23:28:57] Connect_Retry: 60
[2022/10/21 23:28:57] Master_Log_File: mysql-bin.196738
[2022/10/21 23:28:57] Read_Master_Log_Pos: 360334210
[2022/10/21 23:28:57] Relay_Log_File: relay-log.001882
[2022/10/21 23:28:57] Relay_Log_Pos: 360334263
[2022/10/21 23:28:57] Relay_Master_Log_File: mysql-bin.196738
[2022/10/21 23:28:57] Slave_IO_Running: No
[2022/10/21 23:28:57] Slave_SQL_Running: No
[2022/10/21 23:28:57] Exec_Master_Log_Pos: 360334210
[2022/10/21 23:28:57] Seconds_Behind_Master: NULL
数据库可以登录,有大量 kill.sh 脚本执行中。
[2022/10/21 23:30:58] root@DB# ssh x.x.x.78
[2022/10/21 23:30:58] Last login: Fri Oct 21 23:29:18 2022 from x.x.x.71
[2022/10/21 23:30:59] [root@DB ~]#
[2022/10/21 23:31:04] [root@DB ~]# ps -ef|grep mysql
[2022/10/21 23:31:04] root 45784 45780 0 23:21 ? 00:00:00 /bin/sh -c sh /export/mysql_script/mysql_tools/kill.sh > /dev/null 2>&1
[2022/10/21 23:31:04] root 45785 45784 0 23:21 ? 00:00:00 sh /export/mysql_script/mysql_tools/kill.sh
[2022/10/21 23:31:04] root 45797 45785 0 23:21 ? 00:00:00 /export/servers/mysql/bin/mysql -uadmin -px xxxxxxxxxxxxxxx -eshow databases;
[2022/10/21 23:31:04] root 45803 45786 0 23:21 ? 00:00:00 /export/servers/mysql/bin/mysql -uzabbix -px xxxx -Ne select count(*) from information_schema.processlist where info is not null and time>600;
[2022/10/21 23:31:04] root 45808 45806 0 23:22 ? 00:00:00 /bin/sh -c sh /export/mysql_script/mysql_tools/kill.sh > /dev/null 2>&1
[2022/10/21 23:31:04] root 45811 45808 0 23:22 ? 00:00:00 sh /export/mysql_script/mysql_tools/kill.sh
[2022/10/21 23:31:04] root 45817 45811 0 23:22 ? 00:00:00 /export/servers/mysql/bin/mysql -uadmin -px xxxxxxxxxxxxxxx -eshow databases;
[2022/10/21 23:31:04] root 45829 45814 0 23:22 ? 00:00:00 /export/servers/mysql/bin/mysql -uzabbix -px xxxx -Ne select count(*) from information_schema.processlist where info is not null and time>600;
[2022/10/21 23:31:04] root 45860 45857 0 23:23 ? 00:00:00 /bin/sh -c sh /export/mysql_script/mysql_tools/kill.sh > /dev/null 2>&1
[2022/10/21 23:31:04] root 45863 45860 0 23:23 ? 00:00:00 sh /export/mysql_script/mysql_tools/kill.sh
[2022/10/21 23:31:04] root 45874 45863 0 23:23 ? 00:00:00 /export/servers/mysql/bin/mysql -uadmin -px xxxxxxxxxxxxxxx -eshow databases;
[2022/10/21 23:31:04] root 45880 45864 0 23:23 ? 00:00:00 /export/servers/mysql/bin/mysql -uzabbix -px xxxx -Ne select count(*) from information_schema.processlist where info is not null and time>600;
[2022/10/21 23:31:04] root 46010 46007 0 23:24 ? 00:00:00 /bin/sh -c sh /export/mysql_script/mysql_tools/kill.sh > /dev/null 2>&1
[2022/10/21 23:31:04] root 46013 46010 0 23:24 ? 00:00:00 sh /export/mysql_script/mysql_tools/kill.sh
[2022/10/21 23:31:04] root 46024 46013 0 23:24 ? 00:00:00 /export/servers/mysql/bin/mysql -uadmin -px xxxxxxxxxxxxxxx -eshow databases;
[2022/10/21 23:31:04] root 46030 46014 0 23:24 ? 00:00:00 /export/servers/mysql/bin/mysql -uzabbix -px xxxx -Ne select count(*) from information_schema.processlist where info is not null and time>600;
[2022/10/21 23:31:04] root 46037 46033 0 23:25 ? 00:00:00 /bin/sh -c sh /export/mysql_script/mysql_tools/kill.sh > /dev/null 2>&1
[2022/10/21 23:31:04] root 46041 46037 0 23:25 ? 00:00:00 sh /export/mysql_script/mysql_tools/kill.sh
[2022/10/21 23:31:04] root 46061 46041 0 23:25 ? 00:00:00 /export/servers/mysql/bin/mysql -uadmin -px xxxxxxxxxxxxxxx -eshow databases;
[2022/10/21 23:31:04] root 46067 46043 0 23:25 ? 00:00:00 /export/servers/mysql/bin/mysql -uzabbix -px xxxx -Ne select count(*) from information_schema.processlist where info is not null and time>600;
[2022/10/21 23:31:04] root 46077 46074 0 23:26 ? 00:00:00 /bin/sh -c sh /export/mysql_script/mysql_tools/kill.sh > /dev/null 2>&1
[2022/10/21 23:31:04] root 46080 46077 0 23:26 ? 00:00:00 sh /export/mysql_script/mysql_tools/kill.sh
[2022/10/21 23:31:04] root 46095 46080 0 23:26 ? 00:00:00 /export/servers/mysql/bin/mysql -uadmin -px xxxxxxxxxxxxxxx -eshow databases;
[2022/10/21 23:31:04] root 46099 46081 0 23:26 ? 00:00:00 /export/servers/mysql/bin/mysql -uzabbix -px xxxx -Ne select count(*) from information_schema.processlist where info is not null and time>600;
[2022/10/21 23:31:04] root 46192 46190 0 23:27 ? 00:00:00 /bin/sh -c sh /export/mysql_script/mysql_tools/kill.sh > /dev/null 2>&1
[2022/10/21 23:31:04] root 46196 46192 0 23:27 ? 00:00:00 sh /export/mysql_script/mysql_tools/kill.sh
[2022/10/21 23:31:04] root 46207 46196 0 23:27 ? 00:00:00 /export/servers/mysql/bin/mysql -uadmin -px xxxxxxxxxxxxxxx -eshow databases;
[2022/10/21 23:31:04] root 46213 46195 0 23:27 ? 00:00:00 /export/servers/mysql/bin/mysql -uzabbix -px xxxx -Ne select count(*) from information_schema.processlist where info is not null and time>600;
[2022/10/21 23:31:04] root 46219 46215 0 23:28 ? 00:00:00 /bin/sh -c sh /export/mysql_script/mysql_tools/kill.sh > /dev/null 2>&1
[2022/10/21 23:31:04] root 46220 46219 0 23:28 ? 00:00:00 sh /export/mysql_script/mysql_tools/kill.sh
[2022/10/21 23:31:04] root 46225 46220 0 23:28 ? 00:00:00 /export/servers/mysql/bin/mysql -uadmin -px xxxxxxxxxxxxxxx -eshow databases;
[2022/10/21 23:31:04] root 46238 46223 0 23:28 ? 00:00:00 /export/servers/mysql/bin/mysql -uzabbix -px xxxx -Ne select count(*) from information_schema.processlist where info is not null and time>600;
[2022/10/21 23:31:04] root 46244 46240 0 23:29 ? 00:00:00 /bin/sh -c sh /export/mysql_script/mysql_tools/kill.sh > /dev/null 2>&1
[2022/10/21 23:31:04] root 46247 46244 0 23:29 ? 00:00:00 sh /export/mysql_script/mysql_tools/kill.sh
[2022/10/21 23:31:04] root 46257 46247 0 23:29 ? 00:00:00 /export/servers/mysql/bin/mysql -uadmin -px xxxxxxxxxxxxxxx -eshow databases;
[2022/10/21 23:31:04] root 46263 46246 0 23:29 ? 00:00:00 /export/servers/mysql/bin/mysql -uzabbix -px xxxx -Ne select count(*) from information_schema.processlist where info is not null and time>600;
[2022/10/21 23:31:04] root 46398 46392 0 23:30 ? 00:00:00 /bin/sh -c sh /export/mysql_script/mysql_tools/kill.sh > /dev/null 2>&1
[2022/10/21 23:31:04] root 46399 46398 0 23:30 ? 00:00:00 sh /export/mysql_script/mysql_tools/kill.sh
[2022/10/21 23:31:04] root 46409 46399 0 23:30 ? 00:00:00 /export/servers/mysql/bin/mysql -uadmin -px xxxxxxxxxxxxxxx -eshow databases;
[2022/10/21 23:31:04] root 46424 46404 0 23:30 ? 00:00:00 /export/servers/mysql/bin/mysql -uzabbix -px xxxx -Ne select count(*) from information_schema.processlist where info is not null and time>600;
[2022/10/21 23:31:04] root 46459 46427 0 23:30 pts/1 00:00:00 grep mysql
[2022/10/21 23:31:04] root 87541 1 0 Jul12 ? 00:00:00 /bin/sh /export/servers/mysql/bin/mysqld_safe --defaults-file=/export/servers/mysql/etc/my.cnf
[2022/10/21 23:31:04] mysql 88618 87541 99 Jul12 ? 188-00:01:25 /export/servers/mysql/bin/mysqld --defaults-file=/export/servers/mysql/etc/my.cnf --basedir=/export/servers/mysql --datadir=/export/data/mysql/data --plugin-dir=/export/servers/mysql/lib/plugin --user=mysql --log-error=/export/data/mysql/log/error.log --open-files-limit=65535 --pid-file=/export/data/mysql/mysql.pid --socket=/export/data/mysql/tmp/mysql.sock --port=3358
数据库运行中,并且可以正常登录。
[2022/10/21 23:32:30] [root@DB ~]# ps -ef|grep mysql
[2022/10/21 23:32:30] root 46698 46427 0 23:32 pts/1 00:00:00 grep mysql
[2022/10/21 23:32:30] root 87541 1 0 Jul12 ? 00:00:00 /bin/sh /export/servers/mysql/bin/mysqld_safe --defaults-file=/export/servers/mysql/etc/my.cnf
[2022/10/21 23:32:30] mysql 88618 87541 99 Jul12 ? 188-00:02:14 /export/servers/mysql/bin/mysqld --defaults-file=/export/servers/mysql/etc/my.cnf --basedir=/export/servers/mysql --datadir=/export/data/mysql/data --plugin-dir=/export/servers/mysql/lib/plugin --user=mysql --log-error=/export/data/mysql/log/error.log --open-files-limit=65535 --pid-file=/export/data/mysql/mysql.pid --socket=/export/data/mysql/tmp/mysql.sock --port=3358
23:28 数据库无法登录,23:31 数据库可以登录,不过位点不变。
[2022/10/21 23:36:51] mysql> SHOW Master STATUS;
[2022/10/21 23:36:51] +------------------+-----------+--------------+------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
[2022/10/21 23:36:51] | File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
[2022/10/21 23:36:51] +------------------+-----------+--------------+------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
[2022/10/21 23:36:51] | mysql-bin.196738 | 360491782 | | | 0d1d233b-01f1-11ed-b58d-246e965a0842:1-4344447171,
[2022/10/21 23:36:51] 15c89410-d0f5-11ea-a06f-246e96774111:1-8557185849,
[2022/10/21 23:36:51] 216fc3f0-d38f-11e8-8449-a0369fe0e240:1-51625372,
[2022/10/21 23:36:51] 40e74288-d38f-11e8-8d6e-a0369fef393a:1-1492087776:1492136015-1498353734,
[2022/10/21 23:36:51] 86fc2719-59ac-11ea-9508-38eaa7178ee4:1,
[2022/10/21 23:36:51] aa4b4067-d16c-11ea-97c4-246e9675d891:33-19035308026,
[2022/10/21 23:36:51] c043a384-589d-11ea-b94b-38eaa7178cc1:1-2284366:2341679-1390335892 |
[2022/10/21 23:36:51] +------------------+-----------+--------------+------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
[2022/10/21 23:36:51] 1 row in set (0.00 sec)
[2022/10/21 23:36:51]
[2022/10/21 23:36:51] mysql> SHOW Master STATUS;
[2022/10/21 23:36:51] +------------------+-----------+--------------+------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
[2022/10/21 23:36:51] | File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
[2022/10/21 23:36:51] +------------------+-----------+--------------+------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
[2022/10/21 23:36:51] | mysql-bin.196738 | 360491782 | | | 0d1d233b-01f1-11ed-b58d-246e965a0842:1-4344447171,
[2022/10/21 23:36:51] 15c89410-d0f5-11ea-a06f-246e96774111:1-8557185849,
[2022/10/21 23:36:51] 216fc3f0-d38f-11e8-8449-a0369fe0e240:1-51625372,
[2022/10/21 23:36:51] 40e74288-d38f-11e8-8d6e-a0369fef393a:1-1492087776:1492136015-1498353734,
[2022/10/21 23:36:51] 86fc2719-59ac-11ea-9508-38eaa7178ee4:1,
[2022/10/21 23:36:51] aa4b4067-d16c-11ea-97c4-246e9675d891:33-19035308026,
[2022/10/21 23:36:51] c043a384-589d-11ea-b94b-38eaa7178cc1:1-2284366:2341679-1390335892 |
[2022/10/21 23:36:51] +------------------+-----------+--------------+------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
[2022/10/21 23:36:51] 1 row in set (0.00 sec)
该数据库是报表库,通过蜂巢系统从生产库自动同步至报表库。
WMS 系统运营反馈报表不更新,首先怀疑蜂巢需要重启,联系蜂巢后反馈写入报错只读。
java.sql.SQLException: The MySQL server is running with the --read-only option so it cannot execute this statement
然后查看 VIP 绑定关系,发现主从 VIP 已经都绑定到了从库,而从库只读未关闭,因此需要手动处理。
主库,主 VIP 已摘掉。
[2022/10/21 23:39:50] [root@DB ~]# ip a
[2022/10/21 23:39:50] 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN
[2022/10/21 23:39:50] link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
[2022/10/21 23:39:50] inet 127.0.0.1/8 scope host lo
[2022/10/21 23:39:50] inet6 ::1/128 scope host
[2022/10/21 23:39:50] valid_lft forever preferred_lft forever
[2022/10/21 23:39:50] 2: eth2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
[2022/10/21 23:39:50] link/ether 24:6e:96:5a:08:40 brd ff:ff:ff:ff:ff:ff
[2022/10/21 23:39:50] 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
[2022/10/21 23:39:50] link/ether 24:6e:96:5a:08:42 brd ff:ff:ff:ff:ff:ff
[2022/10/21 23:39:50] inet x.x.x.78/24 brd x.x.x.255 scope global eth1
[2022/10/21 23:39:50] inet6 fe80::266e:96ff:fe5a:842/64 scope link
[2022/10/21 23:39:50] valid_lft forever preferred_lft forever
[2022/10/21 23:39:50] 4: p4p1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
[2022/10/21 23:39:50] link/ether a0:36:9f:df:3c:6c brd ff:ff:ff:ff:ff:ff
[2022/10/21 23:39:50] 5: p4p2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
[2022/10/21 23:39:50] link/ether a0:36:9f:df:3c:6e brd ff:ff:ff:ff:ff:ff
从库,主从 VIP 已绑定。
[2022/10/21 23:55:13] [root@DB ~]# ip a
[2022/10/21 23:55:13] 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN
[2022/10/21 23:55:13] link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
[2022/10/21 23:55:13] inet 127.0.0.1/8 scope host lo
[2022/10/21 23:55:13] inet6 ::1/128 scope host
[2022/10/21 23:55:13] valid_lft forever preferred_lft forever
[2022/10/21 23:55:13] 2: eth4: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
[2022/10/21 23:55:13] link/ether 24:6e:96:77:41:10 brd ff:ff:ff:ff:ff:ff
[2022/10/21 23:55:13] 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
[2022/10/21 23:55:13] link/ether 24:6e:96:77:41:11 brd ff:ff:ff:ff:ff:ff
[2022/10/21 23:55:13] inet x.x.x.24/24 brd x.x.x.255 scope global eth1
[2022/10/21 23:55:13] inet x.x.x.20/32 scope global eth1
[2022/10/21 23:55:13] inet x.x.x.21/32 scope global eth1
[2022/10/21 23:55:13] inet6 fe80::266e:96ff:fe77:4111/64 scope link
[2022/10/21 23:55:13] valid_lft forever preferred_lft forever
[2022/10/21 23:55:13] 4: eth2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
[2022/10/21 23:55:13] link/ether 24:6e:96:77:41:12 brd ff:ff:ff:ff:ff:ff
[2022/10/21 23:55:13] 5: eth3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
[2022/10/21 23:55:13] link/ether 24:6e:96:77:41:13 brd ff:ff:ff:ff:ff:ff
[2022/10/21 23:55:13] [root@DB ~]#
[2022/10/21 23:56:11] [root@DB ~]# ps -ef|grep vrrp
[2022/10/21 23:56:11] root 38256 1 0 23:26 ? 00:00:00 /export/servers/vrrp_scripts/vrrpd-1.0/vrrpd -i eth1 -n -v 20 -p 150 x.x.x.20
[2022/10/21 23:56:11] root 38279 1 0 23:26 ? 00:00:00 /export/servers/vrrp_scripts/vrrpd-1.0/vrrpd -i eth1 -n -v 21 -p 150 x.x.x.21
[2022/10/21 23:56:11] root 41028 40942 0 23:55 pts/1 00:00:00 grep vrrp
手动处理
由于主库已恢复,从库复制关系正常,因此将从库绑定的主 VIP 摘掉后绑定到主库,相当于回滚主从切换。
从库,kill 主 VIP,发现 kill 无效。
[2022/10/21 23:57:33] [root@DB ~]# kill -9 38256
[2022/10/21 23:57:33] [root@DB ~]#
[2022/10/21 23:57:37] [root@DB ~]# ip a
[2022/10/21 23:57:37] 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN
[2022/10/21 23:57:37] link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
[2022/10/21 23:57:37] inet 127.0.0.1/8 scope host lo
[2022/10/21 23:57:37] inet6 ::1/128 scope host
[2022/10/21 23:57:37] valid_lft forever preferred_lft forever
[2022/10/21 23:57:37] 2: eth4: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
[2022/10/21 23:57:37] link/ether 24:6e:96:77:41:10 brd ff:ff:ff:ff:ff:ff
[2022/10/21 23:57:37] 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
[2022/10/21 23:57:37] link/ether 24:6e:96:77:41:11 brd ff:ff:ff:ff:ff:ff
[2022/10/21 23:57:37] inet x.x.x.24/24 brd x.x.x.255 scope global eth1
[2022/10/21 23:57:37] inet x.x.x.20/32 scope global eth1
[2022/10/21 23:57:37] inet x.x.x.21/32 scope global eth1
[2022/10/21 23:57:37] inet6 fe80::266e:96ff:fe77:4111/64 scope link
[2022/10/21 23:57:37] valid_lft forever preferred_lft forever
[2022/10/21 23:57:37] 4: eth2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
[2022/10/21 23:57:37] link/ether 24:6e:96:77:41:12 brd ff:ff:ff:ff:ff:ff
[2022/10/21 23:57:37] 5: eth3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
[2022/10/21 23:57:37] link/ether 24:6e:96:77:41:13 brd ff:ff:ff:ff:ff:ff
[2022/10/21 23:57:37] [root@DB ~]#
[2022/10/21 23:58:07] [root@DB ~]# ps -ef|grep vrrp
[2022/10/21 23:58:07] root 38279 1 0 23:26 ? 00:00:00 /export/servers/vrrp_scripts/vrrpd-1.0/vrrpd -i eth1 -n -v 21 -p 150 x.x.x.21
[2022/10/21 23:58:07] root 41216 40942 0 23:57 pts/1 00:00:00 grep vrrp
[2022/10/21 23:58:23] [root@DB ~]#
[2022/10/21 23:58:27] [root@DB ~]# ip a
[2022/10/21 23:58:27] 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN
[2022/10/21 23:58:27] link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
[2022/10/21 23:58:27] inet 127.0.0.1/8 scope host lo
[2022/10/21 23:58:27] inet6 ::1/128 scope host
[2022/10/21 23:58:27] valid_lft forever preferred_lft forever
[2022/10/21 23:58:27] 2: eth4: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
[2022/10/21 23:58:27] link/ether 24:6e:96:77:41:10 brd ff:ff:ff:ff:ff:ff
[2022/10/21 23:58:27] 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
[2022/10/21 23:58:27] link/ether 24:6e:96:77:41:11 brd ff:ff:ff:ff:ff:ff
[2022/10/21 23:58:27] inet x.x.x.24/24 brd x.x.x.255 scope global eth1
[2022/10/21 23:58:27] inet x.x.x.20/32 scope global eth1
[2022/10/21 23:58:27] inet x.x.x.21/32 scope global eth1
[2022/10/21 23:58:27] inet6 fe80::266e:96ff:fe77:4111/64 scope link
[2022/10/21 23:58:27] valid_lft forever preferred_lft forever
[2022/10/21 23:58:27] 4: eth2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
[2022/10/21 23:58:27] link/ether 24:6e:96:77:41:12 brd ff:ff:ff:ff:ff:ff
[2022/10/21 23:58:27] 5: eth3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
[2022/10/21 23:58:27] link/ether 24:6e:96:77:41:13 brd ff:ff:ff:ff:ff:ff
从库,使用ip addr del
命令成功摘掉主 VIP。
[2022/10/21 23:58:59] [root@DB ~]# ip addr del x.x.x.20 dev eth1
[2022/10/21 23:58:59] Warning: Executing wildcard deletion to stay compatible with old scripts.
[2022/10/21 23:58:59] Explicitly specify the prefix length (x.x.x.20/32) to avoid this warning.
[2022/10/21 23:58:59] This special behaviour is likely to disappear in further releases,
[2022/10/21 23:58:59] fix your scripts!
[2022/10/21 23:58:59] [root@DB ~]#
[2022/10/21 23:59:11] [root@DB ~]# ip a
[2022/10/21 23:59:11] 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN
[2022/10/21 23:59:11] link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
[2022/10/21 23:59:11] inet 127.0.0.1/8 scope host lo
[2022/10/21 23:59:11] inet6 ::1/128 scope host
[2022/10/21 23:59:11] valid_lft forever preferred_lft forever
[2022/10/21 23:59:11] 2: eth4: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
[2022/10/21 23:59:11] link/ether 24:6e:96:77:41:10 brd ff:ff:ff:ff:ff:ff
[2022/10/21 23:59:11] 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
[2022/10/21 23:59:11] link/ether 24:6e:96:77:41:11 brd ff:ff:ff:ff:ff:ff
[2022/10/21 23:59:11] inet x.x.x.24/24 brd x.x.x.255 scope global eth1
[2022/10/21 23:59:11] inet6 fe80::266e:96ff:fe77:4111/64 scope link
[2022/10/21 23:59:11] valid_lft forever preferred_lft forever
[2022/10/21 23:59:11] 4: eth2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
[2022/10/21 23:59:11] link/ether 24:6e:96:77:41:12 brd ff:ff:ff:ff:ff:ff
[2022/10/21 23:59:11] 5: eth3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
[2022/10/21 23:59:11] link/ether 24:6e:96:77:41:13 brd ff:ff:ff:ff:ff:ff
[2022/10/21 23:59:11] [root@DB ~]#
[2022/10/21 23:59:17] [root@DB ~]# ps -ef|grep vrrp
[2022/10/21 23:59:17] root 41296 40942 0 23:58 pts/1 00:00:00 grep vrrp
主库,手动挂载主从 VIP。
[2022/10/21 23:58:16] [root@DB ~]# /export/servers/vrrp_scripts/vrrpd-1.0/vrrpd -i eth1 -n -v 20 -p 150 x.x.x.20
[2022/10/21 23:59:21] [root@DB ~]# /export/servers/vrrp_scripts/vrrpd-1.0/vrrpd -i eth1 -n -v 20 -p 150 x.x.x.21
[2022/10/21 23:59:26] [root@DB ~]#
[2022/10/21 23:59:35] [root@DB ~]# ip a
[2022/10/21 23:59:35] 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN
[2022/10/21 23:59:35] link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
[2022/10/21 23:59:35] inet 127.0.0.1/8 scope host lo
[2022/10/21 23:59:35] inet6 ::1/128 scope host
[2022/10/21 23:59:35] valid_lft forever preferred_lft forever
[2022/10/21 23:59:35] 2: eth2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
[2022/10/21 23:59:35] link/ether 24:6e:96:5a:08:40 brd ff:ff:ff:ff:ff:ff
[2022/10/21 23:59:35] 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
[2022/10/21 23:59:35] link/ether 24:6e:96:5a:08:42 brd ff:ff:ff:ff:ff:ff
[2022/10/21 23:59:35] inet x.x.x.78/24 brd x.x.x.255 scope global eth1
[2022/10/21 23:59:35] inet x.x.x.20/32 scope global eth1
[2022/10/21 23:59:35] inet x.x.x.21/32 scope global eth1
[2022/10/21 23:59:35] inet6 fe80::266e:96ff:fe5a:842/64 scope link
[2022/10/21 23:59:35] valid_lft forever preferred_lft forever
[2022/10/21 23:59:35] 4: p4p1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
[2022/10/21 23:59:35] link/ether a0:36:9f:df:3c:6c brd ff:ff:ff:ff:ff:ff
[2022/10/21 23:59:35] 5: p4p2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
[2022/10/21 23:59:35] link/ether a0:36:9f:df:3c:6e brd ff:ff:ff:ff:ff:ff
[2022/10/21 23:59:35] [root@DB ~]#
[2022/10/21 23:59:38] [root@DB ~]# ps -ef|grep vrrp
[2022/10/21 23:59:38] root 49980 1 0 23:57 ? 00:00:00 /export/servers/vrrp_scripts/vrrpd-1.0/vrrpd -i eth1 -n -v 20 -p 150 x.x.x.20
[2022/10/21 23:59:38] root 50168 1 0 23:58 ? 00:00:00 /export/servers/vrrp_scripts/vrrpd-1.0/vrrpd -i eth1 -n -v 21 -p 150 x.x.x.21
[2022/10/21 23:59:38] root 50214 49983 0 23:59 pts/2 00:00:00 grep vrrp
随后业务恢复正常,接下来排查 MHA 主从切换失败的原因。
问题排查
首先排查日志,然后根据日志中的报错排查代码。
日志
mha_manager.log
日志文件:/export/servers/masterha/jd_mysql_ha/mha_manager.log
根据日志,梳理时间线:
-
2022-10-21 23:24:02,收到告警 -
2022-10-21 23:26:03,开始切换 -
2022-10-21 23:26:12,切换失败
详细日志:
2022-10-21 23:24:02,931 - apscheduler.executors.default - WARNING - found alert server DB/x.x.x.78 - MySQL已5分钟无响应
2022-10-21 23:24:02,931 - apscheduler.executors.default - WARNING - debug get_cluster_list host_list: [(u'x.x.x.78', u'DB/x.x.x.78')]
2022-10-21 23:24:02,944 - apscheduler.executors.default - WARNING - debug get_cluster_list cluster-cluster_id:1441
2022-10-21 23:24:02,944 - apscheduler.executors.default - WARNING - debug get_cluster_list cluster-cluster_source_ip:x.x.x.78
2022-10-21 23:24:02,944 - apscheduler.executors.default - WARNING - debug get_cluster_list cluster-cluster_other_ip:[u'x.x.x.24']
2022-10-21 23:24:02,944 - apscheduler.executors.default - WARNING - debug get_cluster_list cluster-cluster_status:1
2022-10-21 23:24:02,944 - apscheduler.executors.default - WARNING - debug get_cluster_list cluster-cluster_type:vip
2022-10-21 23:24:02,944 - apscheduler.executors.default - WARNING - debug get_cluster_list cluster-master:{'instance_ip': u'x.x.x.78', 'role': 'master', 'vips': [u'x.x.x.20', u'x.x.x.21']}
2022-10-21 23:24:02,945 - apscheduler.executors.default - WARNING - debug get_cluster_list cluster-slave_list:[{'role': 'slave', 'instance_ip': u'x.x.x.24', 'vips': []}]
2022-10-21 23:24:03,138 - apscheduler.executors.default - WARNING - cluster 1441 中x.x.x.24存活检查:True!
2022-10-21 23:24:03,138 - apscheduler.executors.default - WARNING - host info : {'instance_ip': u'x.x.x.78', 'role': 'master', 'vips': [u'x.x.x.20', u'x.x.x.21']} type: <type 'dict'> !
2022-10-21 23:24:03,138 - apscheduler.executors.default - WARNING - Found alert cluster 1441
2022-10-21 23:24:03,138 - apscheduler.executors.default - WARNING - begin add master switch task
2022-10-21 23:24:03,139 - apscheduler.executors.default - WARNING - end add master switch task
2022-10-21 23:24:13,176 - check_alive - ERROR - (2013, 'Lost connection to MySQL server during query (timed out)')
2022-10-21 23:24:33,216 - check_alive - ERROR - (2013, 'Lost connection to MySQL server during query (timed out)')
2022-10-21 23:24:53,255 - check_alive - ERROR - (2013, 'Lost connection to MySQL server during query (timed out)')
2022-10-21 23:25:13,294 - check_alive - ERROR - (2013, 'Lost connection to MySQL server during query (timed out)')
2022-10-21 23:26:03,373 - apscheduler.executors.default - WARNING - Start mha failover, see the cluster 1441 log
2022-10-21 23:26:03,373 - apscheduler.executors.default - WARNING - *****主库宕机,切换开始*****
2022-10-21 23:26:12,092 - apscheduler.executors.default - WARNING - *****主库宕机,切换结束*****
2022-10-21 23:26:12,093 - apscheduler.executors.default - WARNING - master failover failed!
日志显示探活报错 Lost connection to MySQL server during query (timed out),表明数据库无法连接。
online_switch.log
日志文件:/var/log/masterha/app1/switch/online_switch.log
2022-10-21 23:26:03,373 - <module> - INFO - lock_key: lock_1441_master_x.x.x.78_err ,unique value: 13033-139684285337264
2022-10-21 23:26:03,373 - apscheduler.executors.default - WARNING - Start mha failover, see the cluster 1441 log
2022-10-21 23:26:03,373 - apscheduler.executors.default - WARNING - *****主库宕机,切换开始*****
2022-10-21 23:26:03,378 - root - INFO - [server default]
manager_log=/export/Logs/jd_mysql_ha/cluster_1441-202210212326.log
manager_workdir=/export/Logs/jd_mysql_ha
master_binlog_dir=/export/data/mysql/data
master_ip_failover_script=/export/servers/masterha/jd_mysql_ha/failover_vip.py
master_ip_online_change_script=/export/servers/masterha/jd_mysql_ha/online_change_vip.py
report_script=/export/servers/masterha/jd_mysql_ha/send_report.py
password=xxx
ping_interval=1
remote_workdir=/tmp
repl_password=xxx
repl_user=replicater
shutdown_script=""
ssh_user=root
user=mha_admin
use_gtid_auto_pos=0
[server_x.x.x.78]
hostname=x.x.x.78
port=3358
[server_x.x.x.24]
hostname=x.x.x.24
port=3358
2022-10-21 23:26:12,092 - <module> - INFO - --dead_master_ip=<dead_master_ip> is not set. Using x.x.x.78.
2022-10-21 23:26:12,092 - <module> - INFO - Fri Oct 21 23:26:03 2022 - [info] Reading default configuration from /etc/masterha_default.cnf..
Fri Oct 21 23:26:03 2022 - [info] Reading application default configuration from /export/servers/masterha/etc/cluster_1441.conf..
Fri Oct 21 23:26:03 2022 - [info] Reading server configuration from /export/servers/masterha/etc/cluster_1441.conf..
2022-10-21 23:26:12,092 - apscheduler.executors.default - WARNING - *****主库宕机,切换结束*****
2022-10-21 23:26:12,093 - <module> - INFO - delete -- lock_key: lock_1441_master_x.x.x.78_err ,old value: 13033-139684285337264 ,unique value: 13033-139684285337264
2022-10-21 23:26:12,093 - apscheduler.executors.default - WARNING - master failover failed!
cluster_1441.log
集群切换日志文件:/export/Logs/jd_mysql_ha/cluster_1441-202210212326.log
显示 ssh 通,但是数据库无法连接。
Fri Oct 21 23:26:04 2022 - [info] Checking master reachability via MySQL(double check)...
Fri Oct 21 23:26:05 2022 - [info] ok.
...
Fri Oct 21 23:26:05 2022 - [info] * Phase 2: Dead Master Shutdown Phase..
Fri Oct 21 23:26:05 2022 - [info]
Fri Oct 21 23:26:06 2022 - [info] HealthCheck: SSH to x.x.x.78 is reachable.
Fri Oct 21 23:26:07 2022 - [info] Forcing shutdown so that applications never connect to the current master..
Fri Oct 21 23:26:07 2022 - [info] Executing master IP deactivation script:
Fri Oct 21 23:26:07 2022 - [info] /export/servers/masterha/jd_mysql_ha/failover_vip.py --orig_master_host=x.x.x.78 --orig_master_ip=x.x.x.78 --orig_master_port=3358 --command=stopssh --ssh_user=root
Fri Oct 21 23:26:08 2022 - [info] done.
Fri Oct 21 23:26:08 2022 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Fri Oct 21 23:26:08 2022 - [info] * Phase 2: Dead Master Shutdown Phase completed.
MHA 切换阶段二 Dead Master Shutdown Phase 用于防止老主写入,首先调用 failover_vip.py 脚本摘掉老主上的 VIP,如果配置了关闭数据库脚本,也会关闭数据库,线上没有配置,因此不会关闭数据库。
到这里查到了切换报错的位置,日志显示执行 failover_vip.py 脚本时报错,MasterFailover.pm 返回码错误。
Fri Oct 21 23:26:09 2022 - [info] Executing master IP activate script:
Fri Oct 21 23:26:09 2022 - [info] /export/servers/masterha/jd_mysql_ha/failover_vip.py --command=start --ssh_user=root --orig_master_host=x.x.x.78 --orig_master_ip=x.x.x.78 --orig_master_port=3358 --new_master_host=x.x.x.24 --new_master_ip=x.x.x.24 --new_master_port=3358 --new_master_user='mha_admin' --new_master_password=''
(u'x.x.x.20', u'x.x.x.15')
ls /export/servers/vrrp_scripts/vrrpd-1.0/vrrpd && /export/servers/vrrp_scripts/vrrpd-1.0/vrrpd -i eth1 -n -v 20 -p 150 x.x.x.20 > /dev/null 2>/export/Logs/vrrpd.log
(u'x.x.x.21', u'x.x.x.16')
ls /export/servers/vrrp_scripts/vrrpd-1.0/vrrpd && /export/servers/vrrp_scripts/vrrpd-1.0/vrrpd -i eth1 -n -v 21 -p 150 x.x.x.21 > /dev/null 2>/export/Logs/vrrpd.log
Fri Oct 21 23:26:11 2022 - [error][/usr/local/share/perl5/MHA/MasterFailover.pm, ln1628] Failed to activate master IP address for x.x.x.24(x.x.x.24:3358) with return code 1:0
Fri Oct 21 23:26:11 2022 - [error][/usr/local/share/perl5/MHA/ManagerUtil.pm, ln177] Got ERROR: at /usr/local/bin/masterha_master_switch line 53
Fri Oct 21 23:26:11 2022 - [info]
----- Failover Report -----
cluster_1441: MySQL Master failover x.x.x.78(x.x.x.78:3358)
Master x.x.x.78(x.x.x.78:3358) is down!
那么,failover_vip.py 脚本给新主绑定 vip 时报错,但实际上是否绑定成功呢?
从新主系统日志(/var/log/messages)中可以看到,从 23:26:08 时刻开始,主从 VIP 已挂载到新主。
因此,VIP 挂载已完成,但是返回给 MHA 错误。
代码
定时任务
后台运行脚本执行 MHA 故障切换(死切)。
[root@MHA ~]# ps -ef|grep mha|grep -v grep
root 22300 1 0 2022 ? 00:08:41 python27 /export/servers/masterha/jd_mysql_ha/mha_manager.py
脚本中基于 apscheduler 创建定时任务,每 90 秒运行一次 main 方法。
import sys
from apscheduler.schedulers.blocking import BlockingScheduler
from libs.sys_lib import getlogger
if __name__ == "__main__":
reload(sys)
sys.setdefaultencoding('utf8')
log = getlogger(Config.log_file, 'apscheduler.executors.default')
sched = BlockingScheduler()
sched.add_job(main, 'interval', seconds=90, max_instances=10)
sched.start()
注意:
定时任务的执行周期可以分为两种类型:
crontab,表示每隔多少时间执行; interval,表示固定在某一时刻执行。
创建切换任务
main 方法的主要逻辑:
-
get_alert_host_list,根据监控数据获取到服务器故障或 mysql 挂的数据库; -
get_cluster_list,从数据库中查询集群信息,如 ip 与 vip; -
当主库宕机时,异步执行 switch_master 方法。
def main():
# 服务器故障或mysql挂的切换
host_list = get_alert_host_list()
if host_list is not None:
cluster_list = get_cluster_list(host_list,1)
if cluster_list == []:
return
tasks = []
for cluster in cluster_list:
if cluster.get_role(cluster.source_ip) == 'slave':
log.warning('begin add slave switch task')
tasks.append(gevent.spawn(switch_slave, cluster))
log.warning('end add slave switch task')
else:
log.warning('begin add master switch task')
tasks.append(gevent.spawn(switch_master, cluster))
log.warning('end add master switch task')
gevent.joinall(tasks)
...
switch_master 方法的主要逻辑:
-
check_alive,主库探活; -
基于 redis 创建分布式锁,防止同一个任务被多个线程执行; -
调用 mha.run_switch 方法进行切换; -
释放分布式锁。
from libs.redis_lib import RedisLock
Redislk = RedisLock()
def switch_master(cluster):
try:
# 探活!
Config.mha_conn['host'] = cluster.source_ip
host_status = check_alive(Config.mha_conn, Config.log_file)
if host_status is True:
log.warning("The alert server %s seems ok." % cluster.source_ip)
return False
else:
# 分布式锁,防止同一个任务被多个线程执行!
switch_key = "%s_master_%s_err" % (cluster.cluster_id, cluster.master['instance_ip'])
switch_value = Redislk.get(switch_key)
if not switch_value:
log.warning('Other thread is doing master switch!')
return False
log.warning(
"Start mha failover, see the cluster %s log" % cluster.cluster_id
)
log.warning('*****主库宕机,切换开始*****')
mha = Mha(cluster)
mha.run_switch()
log.warning('*****主库宕机,切换结束*****')
with open(mha.manage_log, 'r') as f:
message_to_db = f.read()
mha.remove_conf()
log_status = mha.mha_log_status()
# 释放分布式锁!
Redislk.delete(switch_key, switch_value)
注意在切换前后分别获取与释放分布式锁避免重复切换。
mha.run_switch
方法中调用/usr/local/bin/masterha_master_switch
切换工具。
class Mha():
def run_switch(self):
try:
if self.candidate_host is not None:
proc = subprocess.Popen(
[
'masterha_master_switch', ('--conf=%s' % self.conf_file),
'--master_state=dead', '--interactive=0',
('--dead_master_host=%s' % self.master_host),
'--dead_master_port=3358',
('--new_master_host=%s' % self.candidate_host),
'--new_master_port=3358'
], shell=False, stdin=subprocess.PIPE, stdout=subprocess.PIPE
)
out, err = proc.communicate()
else:
exec_cmd = '/usr/local/bin/masterha_master_switch' + ' --conf=%s ' % self.conf_file + '--master_state=dead --interactive=0' + ' --dead_master_host=%s' % self.master_host + ' --dead_master_port=3358'
proc = subprocess.Popen(exec_cmd, shell=True, stdin=subprocess.PIPE, stdout=subprocess.PIPE,stderr=subprocess.PIPE)
out, err = proc.communicate()
log.info(str(out))
log.info(str(err))
return out, err
except Exception as ex:
log.info('run_switch ERROR:%s' % str(ex))
其中:
-
**调用 subprocess.Popen
创建子进程,用于调用masterha_master_switch
**; -
调用 subprocess.Popen.communicate
获取子进程的返回值。
下面进入 MHA 程序。
MHA 切换
cluster_1441.log 日志中显示连通性校验通过,可是这里的通过是指可以连接还是不可以连接呢?
Fri Oct 21 23:26:04 2022 - [info] Checking master reachability via MySQL(double check)...
Fri Oct 21 23:26:05 2022 - [info] ok.
...
Fri Oct 21 23:26:05 2022 - [info] * Phase 2: Dead Master Shutdown Phase..
根据Checking master reachability
关键字查询/usr/local/share/perl5/MHA/MasterFailover.pm
源码,显示切换前会检测主库的连通性。
# quick check that the dead server is really dead
# not double check when ping_type is insert,
# because check_connection_fast_util can rerurn true if insert-check detects I/O failure.
if ( $servers_config[0]->{ping_type} ne $MHA::ManagerConst::PING_TYPE_INSERT )
{
$log->info("Checking master reachability via MySQL(double check)...");
if (
my $rc = MHA::DBHelper::check_connection_fast_util(
$dead_master->{hostname}, $dead_master->{port},
$dead_master->{user}, $dead_master->{password}
)
)
{
$log->error(
sprintf(
"The master %s is reachable via MySQL (error=%s) ! Stop failover.",
$dead_master->get_hostinfo(), $rc
)
);
croak;
}
$log->info(" ok.");
}
如果数据库可以连接,中止切换,因此如果 ok 之前没有打印报错,表明数据库无法连接。因此切换前的二次检查通过,表明数据库无法连接。
随后 cluster_1441.log 日志打印报错。
Fri Oct 21 23:26:11 2022 - [error][/usr/local/share/perl5/MHA/MasterFailover.pm, ln1628] Failed to activate master IP address for x.x.x.24(x.x.x.24:3358) with return code 1:0
Fri Oct 21 23:26:11 2022 - [error][/usr/local/share/perl5/MHA/ManagerUtil.pm, ln177] Got ERROR: at /usr/local/bin/masterha_master_switch line 53
查看报错行的代码,MasterFailover.pm
第 1628 行,显示调用master_ip_failover_script
脚本时报错。
if ( $new_master->{master_ip_failover_script} ) {
my $command =
"$new_master->{master_ip_failover_script} --command=start --ssh_user=$new_master->{ssh_user} --orig_master_host=$dead_master->{hostname} --orig_master_ip=$dead_master->{ip} --orig_master_port=$dead_master->{port} --new_master_host=$new_master->{hostname} --new_master_ip=$new_master->{ip} --new_master_port=$new_master->{port} --new_master_user=$new_master->{escaped_user} --new_master_password=$new_master->{escaped_password}";
...
my ( $high, $low ) = MHA::ManagerUtil::exec_system( $command, $g_logfile );
if ( $high == 0 && $low == 0 ) {
$log->info(" OK.");
$mail_body .=
$new_master->get_hostinfo() . ": OK: Activated master IP address.n";
}
else {
my $message =
"Failed to activate master IP address for "
. $new_master->get_hostinfo()
. " with return code $high:$low";
$log->error( " " . $message ); # ln1628
查看配置文件/export/servers/masterha/etc/cluster_1441.conf
,显示master_ip_failover_script
脚本对应failover_vip.py
脚本。
master_ip_failover_script=/export/servers/masterha/jd_mysql_ha/failover_vip.py
切换脚本
failover_vip.py
脚本中根据传入的命令行参数来摘掉或绑定 VIP。
上文 MHA 的command
中指定--command=start
,因此表明绑定 VIP 报错。
if __name__ == "__main__":
parser = optparse.OptionParser()
...
orig_master_host = options.orig_master_host
new_master_host = options.new_master_host
vips = get_vip(orig_master_host)
if command == 'stop' or command == 'stopssh':
stop_vips(orig_master_host)
elif command == 'start':
# 获取网卡
interface = get_interface(new_master_host)
# 将vip绑定到ip的指定网卡
out, err = start_vip(new_master_host, vips, interface[0])
if err is None:
sys.exit(0)
else:
sys.exit(1)
elif command == 'status':
print 'OK'
start_vip
方法中同样使用subprocess.Popen
创建子进程绑定 VIP。
def start_vip(host, vips, interface):
out_all = ''
err_all = ''
for vip in vips:
print vip
##ip_tail = vip[1].split('.')[-1]
ip_tail = vip[0].split('.')[-1]
cmd = '/export/servers/vrrp_scripts/vrrpd-1.0/vrrpd -i %s
-n -v %s -p 150 %s >> /export/Logs/vrrpd.log 2>&1' % (
interface, ip_tail, vip[0]
)
print cmd
proc = subprocess.Popen(
['ssh', host, cmd], shell=False,
stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE
)
out, err = proc.communicate()
if err is None:
err = ''
out_all = out_all + out
err_all = err_all + err
return out_all, err_all
其中,返回异常时没有打印日志。因此为排查问题,在 debug 过程中新增日志。
out, err = proc.communicate()
if err is None:
err = ''
else:
print 'err:' + err # 新增
优化
添加日志
failover_vip.py
脚本中start_vip
方法也会返回捕获的异常,但是当前日志中未打印。
out, err = start_vip(new_master_host, vips, interface[0])
if err is None:
sys.exit(0)
else:
sys.exit(1)
因此,打印start_vip
方法的返回值中的报错。
interface = get_interface(new_master_host)
out, err = start_vip(new_master_host, vips, interface[0])
if err is None:
sys.exit(0)
else:
print 'err:' + err # 新增
print 'sys.exit:' + '1' # 新增
sys.exit(1)
测试结果如下所示,成功复现该问题,子进程绑定 VIP 返回异常导致 MHA 切换失败。
更换调用方式
同事反馈子进程的方式坑比较多,因此将绑定 vip 时使用子进程的方式替换为远程调用的方式。
如下所示,start_vip
方法中调用cmd_remote
方法绑定 vip。
def start_vip(host, vips, interface):
out_all = ''
err_all = None
...
for vip in vips:
# 挂载vip
cmd = "nohup /export/servers/vrrp_scripts/vrrpd-1.0/vrrpd -i %s -n -v %s -p 150 %s" % (
interface, ip_tail, ip_vip)
res_cmd = cmd_remote(host, cmd)
if not res_cmd['success']:
err_all = res_cmd['message']
return out_all, err_all
cmd_remote
方法中基于paramiko
模块通过指定 key 的方式登陆服务器执行 shell 命令。
import paramiko
def cmd_remote(ip, cmd):
success = False
message = ''
data = {}
status = None
try:
pkey = paramiko.RSAKey.from_private_key_file('/root/.ssh/id_rsa')
ssh = paramiko.SSHClient()
ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
ssh.connect(hostname=ip,
port=22,
username='root',
pkey=pkey)
if cmd.startswith('nohup'):
invoke = ssh.invoke_shell()
invoke.send(cmd + ' & n ')
# 等待命令执行完成
time.sleep(3)
else:
# 结果放到stdout中,如果有错误将放到stderr中
stdin, stdout, stderr = ssh.exec_command(cmd, timeout=10)
data = {'stdout': [item.strip() for item in stdout.readlines()],
'stderr': [item.strip() for item in stderr.readlines()],
'status': stdout.channel.recv_exit_status()
}
success = True
except Exception as ex:
message = "func cmd_remote error: %s" % (str(ex))
log.error(message)
success = False
finally:
ssh.close()
return {'success': success, 'message': message, 'data': data}
kill.sh
上面提到,登录服务器后发现大量 kill.sh 脚本执行中。
该脚本用于 kill 报表库查询超过 10 分钟的 select 语句。
[root@DB ~]# cat /export/mysql_script/mysql_tools/kill.sh
#!/bin/bash
today=`date -d today +"%Y%m%d"`
DB_list_SQL=`/export/servers/mysql/bin/mysql -uadmin -p -e"show databases;"`
DB_slave_status_SQL=`/export/servers/mysql/bin/mysql -uadmin -p -e"show slave statusG;" |grep Master_Host |awk -F ': ' '{print $2}'`
DB_name="report"
result=$(echo $DB_list_SQL |grep "$DB_name")
function kill_exec_sql_greater_than10min()
{
if [ "$DB_slave_status_SQL" != "" -a "$DB_slave_status_SQL" != "1.1.1.1" -a "$result"!="" ] ;then
echo 'this server is report slave server'
`/export/servers/mysql/bin/mysql -uadmin -p -e "select *,now() from information_schema.processlist where info is not null and user in(report_ro','rp_rw') and COMMAND <> 'Killed' and time>600 ;" >> /export/data/mysql/log/killsql.log.$today`
echo "recover"
`/export/servers/mysql/bin/mysql -uadmin -p -N -e "select concat('KILL ',id,';') from information_schema.processlist where info is not null and user in (report_ro','rp_rw') and COMMAND <> 'Killed' and time>600 ;" > /tmp/processlist_kill.txt`
`/export/servers/mysql/bin/mysql -uadmin -p -e "source /tmp/processlist_kill.txt;"`
else
echo 'this server is master server or report master'
fi
}
kill_exec_sql_greater_than10min
查看 crontab 定时任务,显示该任务每分钟执行一次。
[root@DB ~]# crontab -l|grep kill
*/1 * * * * sh /export/mysql_script/mysql_tools/kill.sh > /dev/null 2>&1
实际上,分散在服务器上的 crontab 任务不易管理,因此已将该任务迁移至 celery 集中管理,并将执行频率从每 1 分钟一次调整为每 10 分钟一次,而该服务器漏掉了。
知识点
MHA
MHA 架构
MHA(Master High Availability)是由日本人开发的一款成熟且开源的 MySQL 高可用程序,基于 Perl 语言开发。
MHA 可以在 0-30秒内平滑的完成数据库故障的自动切换,并支持数据补录和复制关系重建,最大程度上保证数据一致性。
MHA 搭建的条件最少是一主两从,且必须是独立的服务器,不能单机多实例进行搭建。
MHA 整体架构设计如下图所示,该软件由两部分组成,MHA Manager(管理节点)和 MHA Node(数据节点)。两种节点都由多个工具包组成,因此 MHA 本质是就是一个工具包。
Manager 节点和 Node 节点从架构设计上是服务端-客户端模式。其中 Manager 可以单独部署在一台独立的机器上管理多个 master-slave 集群,也可以部署在一台 slave 节点上。Node 运行在每台 MySQL 服务器上。
需要注意的是 MHA 仅负责数据库层面的高可用,所以通常需要和其他方案组合起来,常见的组合方式有:
-
MHA + VIP -
MHA + KeepAlive -
MHA + Zookeeper
其中 MHA + VIP 是一种相对成熟和经典的方案,架构简单,容易落地。应用统一使用 VIP 访问数据库,VIP 可以根据节点的数据状态在不同节点之间漂移,实现无缝切换的高可用。
本文中采用的就是这种方案实现高可用。
VIP
VIP (Vistual IP)值虚拟 ip,是临时绑定在物理网卡上的别名,一块网卡上可以绑定多个别名。
VIP 主要作为可漂移资源用于服务器的主从切换,可以在业务不改变连接字符串的前提下,完成业务连接目标的切换。
业务应用访问数据库时通常通过一个类似于 URL 的连接字符串来定位具体服务的网络地址。
如jdbc:mysql://192.168.1.1:3306/mydb
。
其中最重要的是定位手段是“地址+端口”部分,这个地址可以是 VIP,也可以是 DNS 域名。
因此在业务不改变连接字符串的前提下,可以通过高可用切换时漂移 IP 或域名的方式完成业务连接目标的切换。
可漂移资源主要是网络资源,并不是数据库服务固有的资源。因此该方案多用于自身缺乏高可用保障机制,需要外部框架来完成监控 + 切换流程的数据库服务,是一种较原始的连接切换方案。
其中域名切换比 VIP 切换更可靠。原因是,VIP 需要保证全局唯一,切换操作包括两个步骤,首先在故障节点删除 VIP,然后才可以在切换目标节点添加 VIP。而故障节点删除 VIP 无法保证成功率,如果 VIP 所在故障节点出现复杂问题,有可能出现数据库被判为不可用,但是 VIP 未离线,导致删除 VIP 操作失败。
域名作为漂移资源是由 DNS 服务集中管理控制,切换本身不依赖也不需要对故障节点进行任何操作,而 DNS 服务的高可用和冗余机制是另外一套较为完善的体系,因此基于 DNS 的域名切换是一种更可靠,成功率明显更高的漂移资源切换方案。
但是域名切换也存在痛点,如域名的应用端缓存问题。
本文中 VIP 通过 VRRP 协议进行管理。
VRRP
主机之间的通信都是通过静态路由或默认网关完成,这种通信模式中,路由器成为单点瓶颈,因此当主机之间的路由器出现故障时,通信就会失效,因此引入 VRRP 协议。
VRRP(Virtual Router Redundancy Protocol,虚拟路由冗余协议)是一种为避免路由器出现单点故障的主备模式的容错协议。
具体是将多个物理路由器设备虚拟成一个虚拟路由,这个虚拟路由通过 VIP(一个或多个)对外提供服务。组成虚拟路由器的一组路由器可以分为两种角色:
-
Master 角色,同一时间只有一台物理路由器对外提供各种网络功能,如 ARP 请求,ICMP 数据转发等,这台物理路由设备称为主路由器; -
Backup 角色,其他物理路由器均作为备路由器,不拥有对外的虚拟IP,也不提供对外网络功能,仅仅接收MASTER 的 VRRP 状态通告信息。
可见,VRRP 协议中多个物理路由器设备存在选主环节。实际上,VRRP 根据优先级来确定虚拟路由器中每台设备的角色,Master 设备或 Backup 设备。优先级越高,则越有可能成为 Master 设备。
如下所示,可以通过命令绑定与摘除 VIP。
/export/servers/vrrp_scripts/vrrpd-1.0/vrrpd -i eth1 -n -v 21 -p 150 x.x.x.21
ip addr del x.x.x.21 dev eth1
其中:
-
-p 参数支持指定物理路由设备的优先级,虚拟路由器根据优先级选举出 Master 和 Backup。因此如果多个物理路由设备绑定同一个 VIP 时,优先级最高的设备作为 Master 对外提供服务。
如下所示,MHA 故障切换过程中新主的系统日志中显示该物理路由器设备由 Backup 提升为 Master。
Oct 21 23:26:08 HOSTNAME vrrpd: VRRP ID 20 on eth1: we are now a backup router.
Oct 21 23:26:08 HOSTNAME vrrpd: VRRP ID 21 on eth1: we are now a backup router.
...
Oct 21 23:26:10 HOSTNAME vrrpd: VRRP ID 20 on eth1: we are now the master router.
Oct 21 23:26:11 HOSTNAME vrrpd: VRRP ID 21 on eth1: we are now the master router.
结论
告警显示数据库无响应,主从切换程序显示切换失败,但是并不确定切换进度。
从现象上看,主库可以 ping 通,ssh 也通,但是登录数据库失败。从库复制关系正常。
大约三分钟以后主库登录数据库成功,但实际上主 VIP 已切换到从库,而从库未关闭只读,手动将主 VIP 绑定到主库后恢复正常。
通过排查日志与代码,分析并复现出该问题,子进程绑定 VIP 返回异常导致 MHA 切换失败。
子进程的坑比较多,因此通过将绑定 vip 的方式替换为远程调用的方式修复该问题。
实际上,在第一时间发现问题时尽管对于 MHA 切换流程不熟悉,但如果严格按照切换逻辑检查,完全可以及时处理进而缩短对业务的影响时间。原因三分钟就可以解决的问题最终用了半小时才解决。
在检查切换进度时首先需要关注 vip 的绑定关系是否正确,原因是 MHA 切换一开始就会将 vip 摘掉。
MHA 软件由两部分组成,包括 MHA Manager(管理节点)和 MHA Node(数据节点),MHA 本质上就是一个工具包。
本文中采用 MHA + VIP 实现高可用,并使用 VRRP 协议管理 VIP,支持指定优先级实现选主。
待办
-
redis 分布式锁 -
MHA 部署测试与原理分析 -
MHA 一主两从到一主一从 -
子进程的坑
参考教程
-
《MySQL DBA工作笔记》 -
MySQL高可用MHA原理及测试
-
技术分享 | MHA-MasterFailover 分析
-
yoshinorim / mha4mysql-manager
-
理解VRRP协议
原文始发于微信公众号(丹柿小院):MySQL 子进程绑定 VIP 返回异常导致 MHA 切换失败
版权声明:本文内容由互联网用户自发贡献,该文观点仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 举报,一经查实,本站将立刻删除。
文章由极客之音整理,本文链接:https://www.bmabk.com/index.php/post/178626.html