本帖最后由 郑全 于 2026-4-25 14:50 编辑
如何检查并确认是什么原因导致发生了切换
MySQL InnoDB Cluster 切换原因排查
1. 查看 Group Replication 状态(第一时间)
-- 查看当前集群成员状态
SELECT * FROM performance_schema.replication_group_members;
-- 查看当前 Primary
SELECT * FROM performance_schema.replication_group_members
WHERE member_role = 'PRIMARY';
-- 查看成员详细信息
SELECT member_id, member_host, member_port, member_state, member_role, member_version
FROM performance_schema.replication_group_members;
关键状态字段:member_state = ONLINE / RECOVERING / UNREACHABLE / OFFLINE
2. 查看 Group Replication 消息(判断是否为网络/超时)
-- 查看成员连接/通信统计
SELECT * FROM performance_schema.replication_group_member_stats;
-- 重点关注这些列:
-- count_transactions_checked: 已检查的事务数
-- count_transactions_in_proposal: 待提议的事务数
-- last_conflict_free_transaction: 最后无冲突事务
-- count_disconnects: 断开次数(关键!)
3. 查看错误日志(最重要)
# MySQL 日志路径(根据安装方式)
cat /var/log/mysql/error.log#
或cat /opt/mysql/8.0/data/error.log#
关键搜索词:grep
grep -i "group replication" error.log
grep -i "evict" error.log # 驱逐相关
grep -i "primary" error.log
grep -i "timeout" error.log
grep -i "unreachable" error.log
grep -i "quorum" error.log
grep -i "member" error.log
典型日志标志:
4. MySQL Shell 检查
# 连接管理接口
mysqlsh --uri clusteradmin@localhost:3306
# 查看集群状态
var cluster = dba.getCluster()
cluster.status()
# 查看详细信息(包含各节点延迟、状态)
cluster.status({"extended":1})
# 查看集群拓扑cluster.listRouters()
5. 查看切换前后时间点(定位时间线)
-- 查看事务时间线(需要开启 performance_schema)
SELECT member_id, COUNT_TRANSACTIONS_ROWS_VALIDATING, LAST_CONFLICT_FREE_TRANSACTION
FROM performance_schema.replication_group_member_stats;
-- 查看选举信息(MySQL 8.0+)
SELECT event_name, timer_wait, processlist_state
FROM performance_schema.events_transactions_history_long
WHERE thread_id IN
( SELECT thread_id FROM performance_schema.replication_group_member_stats)
LIMIT 20;
6. 系统层面排查(如果不是 MySQL 自身问题)
# 查看 CPU/内存/OOM
top -b1 | head -20
dmesg | grep -i "oom"
dmesg | grep -i "kill"
# 查看网络丢包/断开
netstat -iip -s link
cat /proc/net/dev
# 查看磁盘 I/O 是否饱和
iostat -x 1 5
dmesg | grep -i "io error"
# 查看 MySQL 进程是否被 OOM Kill
journalctl -u mysql | grep -i "killed"
7. 常见切换原因对照 快速排查路径
① 查 error.log 关键词
(evict/timeout/quorum/elected)
↓② 查 performance_schema.replication_group_members
(当时各节点状态)
↓③ 查 performance_schema.replication_group_member_stats(断开次数)
↓④ 查系统层面(dmesg/top/netstat)
排除 OOM/网络
↓⑤ 综合时间线定位根因
|