重庆思庄Oracle、Redhat认证学习论坛

 找回密码
 注册

QQ登录

只需一步,快速开始

搜索
查看: 1873|回复: 3
打印 上一主题 下一主题

gdb 可能导致 oracle 12.2 rac 节点可能自动重起

[复制链接]
跳转到指定楼层
楼主
发表于 2017-6-23 10:54:19 | 只看该作者 回帖奖励 |倒序浏览 |阅读模式
本帖最后由 郑全 于 2017-6-23 12:33 编辑

Unexpected clusterware crashes after the installation or upgrade to 12.2 Grid Infrastructure (文档 ID 2251437.1)

In this Document

Symptoms
Changes
Cause
Solution
References


Applies to:
Oracle Database - Enterprise Edition - Version 12.2.0.1 to 12.2.0.1 [Release 12.2]
Information in this document applies to any platform.
Symptoms
Customer running Grid Infrastructure 12.2 may see clusterware crashes due to the reason that the critical clusterware processes like "ocssd.bin" is getting blocked the OS level. Clusterware CSSD agent/monitor trace file indicate that the CSSD is unresponsive to the local heart beat, which eventually result in the removal of the node from the cluster.

ohasd_cssdagent_root.trc:
2017-04-20 11:34:30.980 : USRTHRD:4255610624: (:CLSN00121:)clsnproc_reboot: Impending reboot at 50% of limit 57900; disk timeout 57900, network timeout 57840, last heartbeat from CSSD at epoch seconds 1492709642.020, 28960 milliseconds ago based on invariant clock 86481264; now polling at 100 ms
2017-04-20 11:34:54.203 : USRTHRD:4255610624: (:CLSN00121:)clsnproc_reboot: Impending reboot at 90% of limit 57900; disk timeout 57900, network timeout 57840, last heartbeat from CSSD at epoch seconds 1492709642.020, 52180 milliseconds ago based on invariant clock 86481264; now polling at 100 ms

From the Exawatcher/OS Watcher data, we would see that the CSSD process is is in "t" state.
zzz <04/20/2017 11:34:06> subcount: 165
4 t oracle 30987 1 0 0 139 - - 259312 869777 ptrace_stop Apr19 ? 00:10:52 /u01/app/12.2.0.1/grid/bin/ocssd.bin
status "t" indicate that the process is  stopped, either by a job control signal or because it is being traced.
And from the process tree, we can see that there is a pstack command running against ocssd.bin and the parent processs is <GRID HOME>/bin/bin/diagsnap.pl script.
4 S root 47516 44696 0 1 19 0 - 20464 57931 hrtimer_nanosleep Apr19 ? 00:00:57 /u01/app/12.2.0.1/grid/perl/bin/perl /u01/app/12.2.0.1/grid/bin/diagsnap.pl start
4 S root 223835 47516 0 6 19 0 - 2516 26588 wait 11:34 ? 00:00:00 sh -c for i in {1..3}; do printf "zzz "; date; /usr/bin/pstack 30987; sleep 5; done >> "/u01/app/orac
le/diagsnap/edlc0003va.xcelenergy.com/evt_1_20170420-113401/pstack_30987_ocssd_bin.trc" 2>&1
0 S root 223850 223835 0 31 19 0 - 2908 26587 wait 11:34 ? 00:00:00 /bin/sh /usr/bin/pstack 30987 <===
4 S root 223889 223850 4 33 19 0 - 60884 57611 wait 11:34 ? 00:00:00 /usr/bin/gdb --quiet --readnever -nx /proc/30987/exe 30987
0 S root 223890 223850 0 30 19 0 - 1200 26331 pipe_wait 11:34 ? 00:00:00 /bin/sed -n -e s/^\((gdb) \)*// -e /^#/p -e /^Thread/p

Another symptom of this issue is several gdb sessions running at the OS level and consuming high CPU. And we can see that these gdb sessions are running against the clusterware processes

21578 root 20 0 357996 127036 25884 R 27.1 1.1 0:00.82 /usr/bin/gdb --quiet -nx /proc/3830/exe 3830
21570 root 20 0 316032 122672 44692 R 26.4 1.0 0:00.80 /usr/bin/gdb --quiet -nx /proc/5352/exe 5352
21587 root 20 0 310676 117236 42404 R 25.7 1.0 0:00.78 /usr/bin/gdb --quiet -nx /proc/4935/exe 4935
21598 root 20 0 357820 127604 26632 R 25.7 1.1 0:00.78 /usr/bin/gdb --quiet -nx /proc/5894/exe 5894
# cat /proc/3830/cmdline
/oracle/app/product/grid/12.2.0.1/bin/ohasd.binreboot
# cat /proc/5352/cmdline
/oracle/app/product/grid/12.2.0.1/bin/ocssd.bin
# cat /proc/4935/cmdline
/oracle/app/product/grid/12.2.0.1/bin/gipcd.bin
# cat /proc/5894/cmdline
/oracle/app/product/grid/12.2.0.1/bin/crsd.bin reboot


Changes
12.2 Grid Infrastructure installed/upgraded.
Cause
Starting with version 12.2.0.1, by default the Cluster Health Monitor (CHM) framework executes continuously the script "<GRID HOME>/bin/diagsnap.pl". Under certain conditions, this script executes the "pstack" command against critical clusterware processes.
The output of "pstack" can be useful for diagnosing clusterware issues, but the "pstack" command execution and locking can lead these key clusterware processes to hang (especially ocssd.bin) which can trigger clusterware crashes.

Solution
pstack collection with the diagsnap.pl is being skipped as part of the fix done in the bug 25717212 which is included in the upcoming 12.2.0.2 patchset.
The current workaround is to disable the diagsnap collection.
# <GRID_HOME>/bin/oclumon manage -disable diagsnap
Exadata customers, please follow the instructions given in the Post-upgrade Steps section of the 12.2 Grid Infrastructure upgrade guide Doc ID 2111010.1 and disable the diagsnap collection for new 12.2 deployments or upgrades.



References
BUG:25810099 - CRS SHUTDOWN OF ONE NODE TRIGGERS NODE REBOOT OF THE OTHER NODE

BUG:25947224 - EXADATA X6-2 CLUSTERS: IB SWITCH RESTART CAUSES EXADATA COMPUTE NODE REBOOT


分享到:  QQ好友和群QQ好友和群 QQ空间QQ空间 腾讯微博腾讯微博 腾讯朋友腾讯朋友
收藏收藏 支持支持 反对反对
回复

使用道具 举报

沙发
 楼主| 发表于 2017-6-23 10:55:56 | 只看该作者
又是等12.2.0.2 psu哈.
回复 支持 反对

使用道具 举报

板凳
 楼主| 发表于 2017-6-23 12:11:09 | 只看该作者
执行的结果:

[grid@rac1 ~]$ oclumon manage -disable diagsnap
Diagsnap option is successfully disabled on rac1
Diagsnap option is successfully disabled on rac2
Successfully Disabled diagsnap
回复 支持 反对

使用道具 举报

地板
 楼主| 发表于 2017-6-23 12:33:54 | 只看该作者
对应的bug号为:

24900613
25785073
25810099
回复 支持 反对

使用道具 举报

您需要登录后才可以回帖 登录 | 注册

本版积分规则

QQ|手机版|小黑屋|重庆思庄Oracle、Redhat认证学习论坛 ( 渝ICP备12004239号-4 )

GMT+8, 2024-4-29 13:22 , Processed in 0.079223 second(s), 18 queries .

重庆思庄学习中心论坛-重庆思庄科技有限公司论坛

© 2001-2020

快速回复 返回顶部 返回列表