gdb 可能导致 oracle 12.2 rac 节点可能自动重起

郑全 · 发表于 2017-6-23 10:54:19

本帖最后由郑全于 2017-6-23 12:33 编辑

Unexpected clusterware crashes after the installation or upgrade to 12.2 Grid Infrastructure (文档 ID 2251437.1)

In this Document

Symptoms
Changes
Cause
Solution
References

Applies to:
Oracle Database - Enterprise Edition - Version 12.2.0.1 to 12.2.0.1 [Release 12.2]
Information in this document applies to any platform.
Symptoms
Customer running Grid Infrastructure 12.2 may see clusterware crashes due to the reason that the critical clusterware processes like "ocssd.bin" is getting blocked the OS level. Clusterware CSSD agent/monitor trace file indicate that the CSSD is unresponsive to the local heart beat, which eventually result in the removal of the node from the cluster.

ohasd_cssdagent_root.trc:
2017-04-20 11:34:30.980 : USRTHRD:4255610624: (:CLSN00121:)clsnproc_reboot: Impending reboot at 50% of limit 57900; disk timeout 57900, network timeout 57840, last heartbeat from CSSD at epoch seconds 1492709642.020, 28960 milliseconds ago based on invariant clock 86481264; now polling at 100 ms
2017-04-20 11:34:54.203 : USRTHRD:4255610624: (:CLSN00121:)clsnproc_reboot: Impending reboot at 90% of limit 57900; disk timeout 57900, network timeout 57840, last heartbeat from CSSD at epoch seconds 1492709642.020, 52180 milliseconds ago based on invariant clock 86481264; now polling at 100 ms

From the Exawatcher/OS Watcher data, we would see that the CSSD process is is in "t" state.
zzz <04/20/2017 11:34:06> subcount: 165
4 t oracle 30987 1 0 0 139 - - 259312 869777 ptrace_stop Apr19 ? 00:10:52 /u01/app/12.2.0.1/grid/bin/ocssd.bin
status "t" indicate that the process is stopped, either by a job control signal or because it is being traced.
And from the process tree, we can see that there is a pstack command running against ocssd.bin and the parent processs is <GRID HOME>/bin/bin/diagsnap.pl script.
4 S root 47516 44696 0 1 19 0 - 20464 57931 hrtimer_nanosleep Apr19 ? 00:00:57 /u01/app/12.2.0.1/grid/perl/bin/perl /u01/app/12.2.0.1/grid/bin/diagsnap.pl start
4 S root 223835 47516 0 6 19 0 - 2516 26588 wait 11:34 ? 00:00:00 sh -c for i in {1..3}; do printf "zzz "; date; /usr/bin/pstack 30987; sleep 5; done >> "/u01/app/orac
le/diagsnap/edlc0003va.xcelenergy.com/evt_1_20170420-113401/pstack_30987_ocssd_bin.trc" 2>&1
0 S root 223850 223835 0 31 19 0 - 2908 26587 wait 11:34 ? 00:00:00 /bin/sh /usr/bin/pstack 30987 <===
4 S root 223889 223850 4 33 19 0 - 60884 57611 wait 11:34 ? 00:00:00 /usr/bin/gdb --quiet --readnever -nx /proc/30987/exe 30987
0 S root 223890 223850 0 30 19 0 - 1200 26331 pipe_wait 11:34 ? 00:00:00 /bin/sed -n -e s/^$(gdb) $*// -e /^#/p -e /^Thread/p

Another symptom of this issue is several gdb sessions running at the OS level and consuming high CPU. And we can see that these gdb sessions are running against the clusterware processes

21578 root 20 0 357996 127036 25884 R 27.1 1.1 0:00.82 /usr/bin/gdb --quiet -nx /proc/3830/exe 3830
21570 root 20 0 316032 122672 44692 R 26.4 1.0 0:00.80 /usr/bin/gdb --quiet -nx /proc/5352/exe 5352
21587 root 20 0 310676 117236 42404 R 25.7 1.0 0:00.78 /usr/bin/gdb --quiet -nx /proc/4935/exe 4935
21598 root 20 0 357820 127604 26632 R 25.7 1.1 0:00.78 /usr/bin/gdb --quiet -nx /proc/5894/exe 5894
# cat /proc/3830/cmdline
/oracle/app/product/grid/12.2.0.1/bin/ohasd.binreboot
# cat /proc/5352/cmdline
/oracle/app/product/grid/12.2.0.1/bin/ocssd.bin
# cat /proc/4935/cmdline
/oracle/app/product/grid/12.2.0.1/bin/gipcd.bin
# cat /proc/5894/cmdline
/oracle/app/product/grid/12.2.0.1/bin/crsd.bin reboot

Changes
12.2 Grid Infrastructure installed/upgraded.
Cause
Starting with version 12.2.0.1, by default the Cluster Health Monitor (CHM) framework executes continuously the script "<GRID HOME>/bin/diagsnap.pl". Under certain conditions, this script executes the "pstack" command against critical clusterware processes.
The output of "pstack" can be useful for diagnosing clusterware issues, but the "pstack" command execution and locking can lead these key clusterware processes to hang (especially ocssd.bin) which can trigger clusterware crashes.

Solution
pstack collection with the diagsnap.pl is being skipped as part of the fix done in the bug 25717212 which is included in the upcoming 12.2.0.2 patchset.
The current workaround is to disable the diagsnap collection.
# <GRID_HOME>/bin/oclumon manage -disable diagsnap
Exadata customers, please follow the instructions given in the Post-upgrade Steps section of the 12.2 Grid Infrastructure upgrade guide Doc ID 2111010.1 and disable the diagsnap collection for new 12.2 deployments or upgrades.

References
BUG:25810099 - CRS SHUTDOWN OF ONE NODE TRIGGERS NODE REBOOT OF THE OTHER NODE

BUG:25947224 - EXADATA X6-2 CLUSTERS: IB SWITCH RESTART CAUSES EXADATA COMPUTE NODE REBOOT

郑全 · 发表于 2017-6-23 10:55:56

又是等12.2.0.2 psu哈.

郑全 · 发表于 2017-6-23 12:11:09

执行的结果:

[grid@rac1 ~]$ oclumon manage -disable diagsnap
Diagsnap option is successfully disabled on rac1
Diagsnap option is successfully disabled on rac2
Successfully Disabled diagsnap

郑全 · 发表于 2017-6-23 12:33:54

对应的bug号为:

24900613
25785073
25810099

帐号		自动登录	找回密码
密码			注册