重庆思庄Oracle、Redhat认证学习论坛

 找回密码
 注册

QQ登录

只需一步,快速开始

搜索
查看: 1117|回复: 0
打印 上一主题 下一主题

[Oracle] 11gR2 GI Node May not Join the Cluster After Private Network is Functional

[复制链接]
跳转到指定楼层
楼主
发表于 2021-9-7 15:20:31 | 只看该作者 回帖奖励 |倒序浏览 |阅读模式
11gR2 GI Node May not Join the Cluster After Private Network is Functional After Eviction due to Private Network Problem (文档 ID 1479380.1)


In this Document

Symptoms

Cause

Solution

        #1: Due to bug 13653178 - Node cannot join the cluster after reboot or "interconnect restored"

        #2: Due to bug 13899736 - Node cannot join the cluster after reboot or "interconnect restored"

References

APPLIES TO:

Oracle Database - Enterprise Edition - Version 11.2.0.2 to 11.2.0.3 [Release 11.2]

Oracle Database - Enterprise Edition - Version 11.2.0.4 to 11.2.0.4 [Release 11.2]

Oracle Database - Enterprise Edition - Version 12.1.0.2 to 12.1.0.2 [Release 12.1]

Oracle Database Cloud Schema Service - Version N/A and later

Oracle Database Exadata Cloud Machine - Version N/A and later

Information in this document applies to any platform.

SYMPTOMS

11gR2 Grid Infrastructure, cssd fails to join the cluster (GI fails to start as a result) after recovered from private network failure caused by pulling cluster interconnect cables etc


OS /var/log/messages - node1

Jul 16 16:12:54 <0.6> racnode1 kernel: e1000e: xxxx NIC Link is Down

Jul 16 16:12:55 <0.6> racnode1 kernel: bonding: bond0: link status definitely down for interface xxxx, disabling it

Jul 16 16:12:55 <0.6> racnode1 kernel: bonding: bond0: now running without any active interface ! ##>> private network failed

..

Jul 16 16:15:00 <0.6> racnode1 kernel: bnx2 0000:02:00.1: xx: NIC Copper Link is Up, 1000 Mbps full duplex ##>> private network recovered

Jul 16 16:15:01 <0.6> racnode1 kernel: bonding: bond0: link status definitely up for interface xx.

Jul 16 16:15:01 <0.6> racnode1 kernel: bonding: bond0: making interface xx the new active one.

Jul 16 16:15:01 <0.6> racnode1 kernel: bonding: bond0: first active interface up!

Jul 16 16:15:15 <0.6> racnode1 kernel: e1000e: xx NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None

Jul 16 16:15:16 <0.6> racnode1 kernel: bonding: bond0: link status definitely up for interface xx.

..



<GI_HOME>/log/<node>/alert<node>.log - node1

2012-07-16 16:13:09.846

[cssd(11589)]CRS-1612:Network communication with node <hostname> (2) missing for 50% of timeout interval. Removal of this node from cluster in 14.190 seconds

2012-07-16 16:13:16.860

[cssd(11589)]CRS-1611:Network communication with node <hostname> (2) missing for 75% of timeout interval. Removal of this node from cluster in 7.180 seconds

2012-07-16 16:13:21.870

[cssd(11589)]CRS-1610:Network communication with node <hostname> (2) missing for 90% of timeout interval. Removal of this node from cluster in 2.170 seconds

2012-07-16 16:13:24.038

[cssd(11589)]CRS-1607:Node <hostname> is being evicted in cluster incarnation 237506449; details at (:CSSNM00007:) in <GI_HOME>/log/<hostname>/cssd/ocssd.log.

2012-07-16 16:13:25.020

[ohasd(11154)]CRS-8011:reboot advisory message from host: <hostname>, component: cssmonit, with time stamp: L-2012-07-16-16:13:25.020

[ohasd(11154)]CRS-8013:reboot advisory message text: clsnomon_status: need to reboot, unexpected failure 8 received from CSS

2012-07-16 16:13:25.020

[ohasd(11154)]CRS-8011:reboot advisory message from host: <hostname>, component: cssagent, with time stamp: L-2012-07-16-16:13:25.020

[ohasd(11154)]CRS-8013:reboot advisory message text: clsnomon_status: need to reboot, unexpected failure 8 received from CSS

2012-07-16 16:13:55.048

[cssd(11589)]CRS-1601:CSSD Reconfiguration complete. Active nodes are <hostname> .

2012-07-16 16:13:55.056

[crsd(13213)]CRS-5504:Node down event reported for node '<hostname>'.

2012-07-16 16:13:57.538

[crsd(13213)]CRS-2773:Server '<hostname>' has been removed from pool 'Generic'.

..



<GI_HOME>/log/<node>/alert<node>.log - node2

2012-07-16 16:13:10.052

[cssd(22171)]CRS-1612:Network communication with node racnode1 (1) missing for 50% of timeout interval. Removal of this node from cluster in 14.180 seconds

2012-07-16 16:13:17.066

[cssd(22171)]CRS-1611:Network communication with node racnode1 (1) missing for 75% of timeout interval. Removal of this node from cluster in 7.170 seconds

2012-07-16 16:13:22.076

[cssd(22171)]CRS-1610:Network communication with node racnode1 (1) missing for 90% of timeout interval. Removal of this node from cluster in 2.160 seconds

2012-07-16 16:13:24.243

[cssd(22171)]CRS-1608:This node was evicted by node 1, racnode1; details at (:CSSNM00005:) in /ocw/grid/log/<hostname>/cssd/ocssd.log.

2012-07-16 16:13:24.243

[cssd(22171)]CRS-1608:This node was evicted by node 1, racnode1; details at (:CSSNM00005:) in /ocw/grid/log/<hostname>/cssd/ocssd.log.

2012-07-16 16:13:24.243

[cssd(22171)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /ocw/grid/log/<hostname>/cssd/ocssd.log

2012-07-16 16:13:24.243

[cssd(22171)]CRS-1652:Starting clean up of CRSD resources.

2012-07-16 16:13:24.243

[cssd(22171)]CRS-1609:This node is unable to communicate with other nodes in the cluster and is going down to preserve cluster integrity; details at (:CSSNM00008:) in /ocw/grid/log/<hostname>/cssd/ocssd.log.

..

2012-07-16 16:13:24.962

[ohasd(17683)]CRS-2765:Resource 'ora.evmd' has failed on server 'racnode2'.

2012-07-16 16:13:24.969

[ohasd(17683)]CRS-2765:Resource 'ora.ctssd' has failed on server 'racnode2'. ##>> node being evicted after private network down

2012-07-16 16:19:33.962

[ohasd(17652)]CRS-2112:The OLR service started on node racnode2. ##>> GI auto starting upon node restart

2012-07-16 16:19:34.069

[ohasd(17652)]CRS-1301:Oracle High Availability Service started on node racnode2.

2012-07-16 16:19:34.146

[ohasd(17652)]CRS-8011:reboot advisory message from host: racnode2, component: cssmonit, with time stamp: L-2012-07-16-16:13:25.020

[ohasd(17652)]CRS-8013:reboot advisory message text: clsnomon_status: need to reboot, unexpected failure 8 received from CSS

2012-07-16 16:19:34.194

[ohasd(17652)]CRS-8011:reboot advisory message from host: racnode2, component: cssagent, with time stamp: L-2012-07-16-16:13:25.020

[ohasd(17652)]CRS-8013:reboot advisory message text: clsnomon_status: need to reboot, unexpected failure 8 received from CSS

2012-07-16 16:19:34.196

[ohasd(17652)]CRS-8017:location: /etc/oracle/lastgasp has 2 reboot advisory log files, 2 were announced and 0 errors occurred

2012-07-16 16:19:36.442

[/ocw/grid/bin/oraagent.bin(21497)]CRS-5011:Check of resource "+ASM" failed: details at "(:CLSN00006:)" in "/ocw/grid/log/racnode2/agent/ohasd/oraagent_oracle/oraagent_oracle.log"

2012-07-16 16:19:41.385

[ohasd(17652)]CRS-2302:Cannot get GPnP profile. Error CLSGPNP_NO_DAEMON (GPNPD daemon is not running).

2012-07-16 16:19:41.406

[gpnpd(21770)]CRS-2328:GPNPD started on node racnode2.

2012-07-16 16:19:43.738

[cssd(22106)]CRS-1713:CSSD daemon is started in clustered mode

2012-07-16 16:19:45.x

[ohasd(17652)]CRS-2767:Resource state recovery not attempted for 'ora.diskmon' as its target state is OFFLINE

2012-07-16 16:20:04.732

[cssd(22106)]CRS-1707:Lease acquisition for node racnode2 number 2 completed

..

2012-07-16 16:20:06.041

[cssd(22106)]CRS-1605:CSSD voting file is online: /dev/iscsi/iqn.2006-01.com.oxxxxr:blnhp1102-vm31.asm-crs.disk1.fg1.box3; details in /ocw/grid/log/<hostname>/cssd/ocssd.log.

2012-07-16 16:20:06.078

[cssd(22106)]CRS-1605:CSSD voting file is online: /dev/mpath/360060160d8402xxxx17ecbe111; details in /ocw/grid/log/<hostname>/cssd/ocssd.log.

2012-07-16 16:20:06.132

[cssd(22106)]CRS-1605:CSSD voting file is online: /dev/mpath/360060160exxxxc1fa21d7ecbe111; details in /ocw/grid/log/<hostname>/cssd/ocssd.log.

2012-07-16 16:29:42.783

[/ocw/grid/bin/cssdagent(22084)]CRS-5818:Aborted command 'start' for resource 'ora.cssd'. Details at (:CRSAGF00113:) {0:0:2} in /ocw/grid/log/<hostname>/agent/ohasd/oracssdagent_root/oracssdagent_root.log. ##>> cssd failed after 10 minutes as it can't join the cluster

2012-07-16 16:29:42.783

[cssd(22106)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /ocw/grid/log/<hostname>/cssd/ocssd.log

2012-07-16 16:29:42.783

[cssd(22106)]CRS-1603:CSSD on node racnode2 shutdown by user.

2012-07-16 16:29:48.176

[ohasd(17652)]CRS-2765:Resource 'ora.cssdmonitor' has failed on server 'racnode2'.

2012-07-16 16:29:49.593

[cssd(12197)]CRS-1713:CSSD daemon is started in clustered mode ##>> cssd auto restarting after first failure

2012-07-16 16:29:51.370

[ohasd(17652)]CRS-2767:Resource state recovery not attempted for 'ora.diskmon' as its target state is OFFLINE

2012-07-16 16:30:12.581

[cssd(12197)]CRS-1707:Lease acquisition for node racnode2 number 2 completed

2012-07-16 16:30:13.901

[cssd(12197)]CRS-1605:CSSD voting file is online: /dev/iscsi/iqn.2006-01.com.oxxxer:blnhp1102-vm31.asm-crs.disk1.fg1.box3; details in /ocw/grid/log/<hostname>/cssd/ocssd.log.

2012-07-16 16:30:13.922

[cssd(12197)]CRS-1605:CSSD voting file is online: /dev/mpath/360060160d84xxxxx6217ecbe111; details in /ocw/grid/log/<hostname>/cssd/ocssd.log.

2012-07-16 16:30:13.926

[cssd(12197)]CRS-1605:CSSD voting file is online: /dev/mpath/360060160ee40280xxxxx21d7ecbe111; details in /ocw/grid/log/<hostname>/cssd/ocssd.log.



<GI_HOME>/log/<node>/cssd/ocssd.log - node1

2012-07-16 16:12:56.236: [GIPCHGEN][1091303744] gipchaInterfaceFail: marking interface failing 0xc058600 { host '<hostname>', haName 'CSS_racnode1_X45', local 0xbd4bd50, ip '192.xx.xx.10:21857', subnet '192.xx.xx.0', mask '255.255.255.0', mac '', ifname '', numRef 0, numFail 0, idxBoot 0, flags 0x6 }

2012-07-16 16:12:56.238: [GIPCHGEN][1091303744] gipchaInterfaceDisable: disabling interface 0xbd4bd50 { host '', haName 'CSS_racnode1_X45', local (nil), ip '192.xx.xx.5:35675', subnet '192.xx.xx.0', mask '255.255.255.0', mac '00-26-55-52-75-32', ifname 'bond0', numRef 0, numFail 2, idxBoot 0, flags 0x19cd }

2012-07-16 16:12:56.238: [GIPCHGEN][1091303744] gipchaInterfaceDisable: disabling interface 0x2aaab4081d60 { host '<hostname>', haName 'CSS_racnode1_X45', local 0xbd4bd50, ip '192.xx.xx.5:35675', subnet '192.xx.xx.0', mask '255.255.255.0', mac '', ifname '', numRef 0, numFail 0, idxBoot 3, flags 0x18c6 }

..

2012-07-16 16:12:56.238: [GIPCHALO][1091303744] gipchaLowerCleanInterfaces: performing cleanup of disabled interface 0xc058600 { host '<hostname>', haName 'CSS_racnode1_X45', local 0xbd4bd50, ip '192.xx.xx.10:21857', subnet '192.xx.xx.0', mask '255.255.255.0', mac '', ifname '', numRef 0, numFail 0, idxBoot 0, flags 0xa6 }

2012-07-16 16:12:56.918: [GIPCHGEN][1091303744] gipchaInterfaceDisable: disabling interface 0x2aaab4081d60 { host '<hostname>', haName 'CSS_racnode1_X45', local 0xbd4bd50, ip '192.xx.xx.5:35675', subnet '192.xx.xx.0', mask '255.255.255.0', mac '', ifname '', numRef 0, numFail 0, idxBoot 3, flags 0x18e6 }

2012-07-16 16:12:56.918: [GIPCHGEN][1091303744] gipchaNodeAddInterface: adding interface information for inf 0xbd54c10 { host '<hostname>', haName 'CSS_racnode1_X45', local (nil), ip '192.xx.xx.5:35675', subnet '192.xx.xx.0', mask '255.255.255.0', mac '', ifname '', numRef 0, numFail 0, idxBoot 0, flags 0x1842 }

2012-07-16 16:12:56.918: [GIPCHTHR][1091303744] gipchaWorkerCreateInterface: created remote bootstrap multicast interface for node '<hostname>', haName 'CSS_racnode1_X45', inf 'mcast://224.0.0.251:42424/192.xx.xx.5:35675'

2012-07-16 16:12:56.918: [GIPCHTHR][1091303744] gipchaWorkerCreateInterface: created remote bootstrap multicast interface for node '<hostname>', haName 'CSS_racnode1_X45', inf 'mcast://230.0.1.0:42424/192.xx.xx.5:35675'

2012-07-16 16:12:56.918: [GIPCHTHR][1091303744] gipchaWorkerCreateInterface: created remote bootstrap broadcast interface for node '<hostname>', haName 'CSS_racnode1_X45', inf 'udp://192.xx.xx.255:35675'

..

2012-07-16 16:13:21.870: [ CSSD][1117145408]clssnmPollingThread: node <hostname> (2) at 90% heartbeat fatal, removal in 2.170 seconds, seedhbimpd 1

2012-07-16 16:13:21.870: [ CSSD][1109260608]clssnmvDHBValidateNCopy: node 2, <hostname>, has a disk HB, but no network HB, DHB has rcfg 237506449, wrtcnt, 1203540, LATS 416476174, lastSeqNo 1203534, uniqueness 1342447749, timestamp 1342448001/4294500870

2012-07-16 16:13:22.294: [GIPCHGEN][1091303744] gipchaInterfaceDisable: disabling interface 0x2aaab4081d60 { host '<hostname>', haName 'CSS_racnode1_X45', local 0xbd4bd50, ip '192.xx.xx.5:35675', subnet '192.xx.xx.0', mask '255.255.255.0', mac '', ifname '', numRef 0, numFail 0, idxBoot 3, flags 0x18e6 }

2012-07-16 16:13:22.873: [ CSSD][1097062720]clssnmvDHBValidateNCopy: node 2, <hostname>, has a disk HB, but no network HB, DHB has rcfg 237506449, wrtcnt, 1203545, LATS 416477184, lastSeqNo 1202791, uniqueness 1342447749, timestamp 1342448002/4294501710

2012-07-16 16:13:22.873: [ CSSD][1109260608]clssnmvDHBValidateNCopy: node 2, <hostname>, has a disk HB, but no network HB, DHB has rcfg 237506449, wrtcnt, 1203546, LATS 416477184, lastSeqNo 1203540, uniqueness 1342447749, timestamp 1342448002/4294501870

..

2012-07-16 16:13:24.038: [ CSSD][1120299328](:CSSNM00007:)clssnmrRemoveNode: Evicting node 2, <hostname>, from the cluster in incarnation 237506449, node birth incarnation 237506448, death incarnation 237506449, stateflags 0x224000 uniqueness value 1342447749

..

2012-07-16 16:13:24.038: [ CSSD][1097062720]clssnmvDiskEvict: Kill block write, file /dev/mpath/360060160ee4028006c1fa21d7ecbe111 flags 0x00010004, kill block unique 1342447749, stamp 416478344/416478344

..

2012-07-16 16:13:24.345: [ CSSD][1120299328]clssnmWaitOnEvictions: node 2, undead 1, EXADATA fence handle 0 kill reqest id 0, last DHB (1342448003, 4294503090, 234), seedhbimpd TRUE

2012-07-16 16:13:24.400: [ CSSD][1099016512]clssnmvDiskKillCheck: not evicted, file /dev/iscsi/iqn.2006-01.com.openfiler:blnhp1102-vm31.asm-crs.disk1.fg1.box3 flags 0x00000000, kill block unique 0, my unique 1342439461

..

2012-07-16 16:13:48.019: [GIPCHGEN][1091303744] gipchaInterfaceDisable: disabling interface 0x2aaab4081d60 { host '<hostname>', haName 'CSS_racnode1_X45', local 0xbd4bd50, ip '192.xx.xx.5:35675', subnet '192.xx.xx.0', mask '255.255.255.0', mac '', ifname '', numRef 0, numFail 0, idxBoot 3, flags 0x18e6 }

2012-07-16 16:13:48.022: [ CSSD][1120299328]clssnmWaitOnEvictions: node 2, undead 1, EXADATA fence handle 0 kill reqest id 0, last DHB (1342448004, 4294503340, 235), seedhbimpd TRUE

2012-07-16 16:13:48.022: [ CSSD][1120299328]clssnmWaitOnEviction: node(2) exceeded graceful shutdown period, IPMI-kill allowed if configured

2012-07-16 16:13:48.022: [ CSSD][1120299328]clssnmWaitOnEviction: Node kill could not beperformed. Admin or connection validation failed

2012-07-16 16:13:48.050: [GIPCHGEN][1091303744] gipchaInterfaceDisable: disabling interface 0x2aaab4081d60 { host '<hostname>', haName 'CSS_racnode1_X45', local 0xbd4bd50, ip '192.xx.xx.5:35675', subnet '192.xx.xx.0', mask '255.255.255.0', mac '', ifname '', numRef 0, numFail 0, idxBoot 3, flags 0x18e6 }

2012-07-16 16:13:48.085: [ CSSD][1120299328]clssnmWaitOnEvictions: node 2, undead 1, EXADATA fence handle 0 kill reqest id 0, last DHB (1342448004, 4294503340, 235), seedhbimpd TRUE

..

2012-07-16 16:13:55.041: [ CSSD][1120299328]clssnmWaitOnEvictions: node 2, undead 1, EXADATA fence handle 0 kill reqest id 0, last DHB (1342448004, 4294503340, 235), seedhbimpd TRUE

2012-07-16 16:13:55.041: [ CSSD][1120299328]clssnmCheckKillStatus: Node 2, <hostname>, down, LATS(416479344),timeout(30000)

..

##>> node2 evicted as expected after network failure

2012-07-16 16:22:51.288: [GIPCHALO][1091303744] gipchaLowerProcessNode: no valid interfaces found to node for 595960 ms, node 0x2aaaac7da520 { host '<hostname>', haName 'CSS_racnode1_X45', srcLuid bc1c4e13-f18e423e, dstLuid 18dd3893-1159b293 numInf 1, contigSeq 678, lastAck 648, lastValidAck 678, sendSeq [1018 : 1018], createTime 416247444, sentRegister 1, localMonitor 1, flags 0x2408 }

..

2012-07-16 16:22:55.290: [GIPCHALO][1091303744] gipchaLowerProcessNode: bootstrap node considered dead because of idle connection time 600960 ms, node 0x2aaaac7da520 { host '<hostname>', haName 'CSS_racnode1_X45', srcLuid bc1c4e13-f18e423e, dstLuid 18dd3893-1159b293 numInf 1, contigSeq 678, lastAck 648, lastValidAck 678, sendSeq [1020 : 1020], createTime 416247444, sentRegister 1, localMonitor 1, flags 0x2408 }

##>> even network is restored more than 5 minutes earlier, gipc layer didn't recover it properly



<GI_HOME>/log/<node>/gipcd/gipcd.log - node1

2012-07-16 16:12:41.233: [ CLSINET][1106753856] Returning NETDATA: 1 interfaces

2012-07-16 16:12:41.233: [ CLSINET][1106753856] # 0 Interface 'bond0',ip='192.xx.xx.5',mac='00-26-55-52-75-32',mask='255.255.255.0',net='192.xx.xx.0',use='cluster_interconnect'

2012-07-16 16:12:41.243: [GIPCDMON][1106753856] gipcdMonitorSaveInfMetrics: inf[ 0] bond0 - rank 99, avgms 0.561798 [ 351 / 356 / 356 ]

..

2012-07-16 16:12:56.233: [ CLSINET][1106753856] Returning NETDATA: 0 interfaces

2012-07-16 16:12:56.233: [GIPCDMON][1106753856] gipcdMonitorUpdate: interface went down - [ ip 192.xx.xx.5, subnet 192.xx.xx.0, mask 255.255.255.0, mac 00-26-55-52-75-32, ifname bond0 ]

2012-07-16 16:12:56.233: [GIPCDMON][1106753856] gipcdMonitorUpdate: msg sent to client thread (([update(ip: 192.xx.xx.5, mask: 255.255.255.0, subnet: 192.xx.xx.0, mac: 00-26-55-52-75-32, ifname: bond0), state(gipcdadapterstateDown)]))

2012-07-16 16:12:56.233: [GIPCDNDE][1106753856] gipcdEnqueueMsgForClient: Enqueuing a msg for client : req:[hostname(), id (0000000000000000, 0000000000000000), len(632), req cookie(0), type(gipcdReqTypeMonitorUpdate)]

2012-07-16 16:12:56.233: [GIPCDCLT][1086585152] gipcdProcessNodeRequest: Dequeued req from host , type: gipcdReqTypeMonitorUpdate, id: 0000000000000000, 0000000000000000

2012-07-16 16:12:56.233: [GIPCDCLT][1086585152] gipcdRawInterfaceUpdates: ([update(ip: 192.xx.xx.5, mask: 255.255.255.0, subnet: 192.xx.xx.0, mac: 00-26-55-52-75-32, ifname: bond0), state(gipcdadapterstateDown)])

2012-07-16 16:12:56.233: [GIPCHGEN][1108855104] gipchaInterfaceFail: marking interface failing 0x2aaaac3210f0 { host '', haName 'gipcd_ha_name', local (nil), ip '192.xx.xx.5:25216', subnet '192.xx.xx.0', mask '255.255.255.0', mac '00-26-55-52-75-32', ifname 'bond0', numRef 1, numFail 0, idxBoot 0, flags 0x184d }

2012-07-16 16:12:56.233: [GIPCDCLT][1086585152] gipcdClientThread: Sent req to local client on endp 0000000000000366:(type(gipcdmsgtypeInterfaceUpdate), endp(0000000000000366), len(1032), buf(0x2aaab001e270):[hostname(racnode1), req cookie(0), haname(),update(ip: 192.xx.xx.5, mask: 255.255.255.0, subnet: 192.xx.xx.0, mac: 00-26-55-52-75-32, ifname: bond0), state(gipcdadapterstateDown), retStatus(gipcretSuccess)])

..

2012-07-16 16:13:43.204: [GIPCHALO][1088686400] gipchaLowerProcessNode: no valid interfaces found to node for 47210 ms, node 0x2aaaac2a3e60 { host 'racnode2', haName 'gipcd_ha_name', srcLuid c2d0fd44-d5ea5339, dstLuid a23f2c9c-f4f6d438 numInf 0, contigSeq 23, lastAck 24, lastValidAck 23, sendSeq [25 : 28], createTime 416251644, sentRegister 1, localMonitor 1, flags 0x2008 }

2012-07-16 16:13:46.248: [ CLSINET][1106753856] Returning NETDATA: 0 interfaces

..

2012-07-16 16:15:01.279: [ CLSINET][1106753856] Returning NETDATA: 1 interfaces

2012-07-16 16:15:01.279: [ CLSINET][1106753856] # 0 Interface 'bond0',ip='192.xx.xx.5',mac='00-26-55-52-75-32',mask='255.255.255.0',net='192.xx.xx.0',use='cluster_interconnect'

2012-07-16 16:15:01.279: [GIPCDMON][1106753856] gipcdMonitorUpdate: New Interface found - [ ip 192.xx.xx.5, subnet 192.xx.xx.0, mask 255.255.255.0, mac 00-26-55-52-75-32, ifname bond0]

2012-07-16 16:15:01.279: [GIPCDMON][1106753856] gipcdMonitorUpdate: msg sent to client thread (([update(ip: 192.xx.xx.5, mask: 255.255.255.0, subnet: 192.xx.xx.0, mac: 00-26-55-52-75-32, ifname: bond0), state(gipcdadapterstateUp)]))

2012-07-16 16:15:01.279: [GIPCDNDE][1106753856] gipcdEnqueueMsgForClient: Enqueuing a msg for client : req:[hostname(), id (0000000000000000, 0000000000000000), len(632), req cookie(0), type(gipcdReqTypeMonitorUpdate)]

2012-07-16 16:15:01.279: [GIPCDCLT][1086585152] gipcdProcessNodeRequest: Dequeued req from host , type: gipcdReqTypeMonitorUpdate, id: 0000000000000000, 0000000000000000

2012-07-16 16:15:01.279: [GIPCDCLT][1086585152] gipcdRawInterfaceUpdates: ([update(ip: 192.xx.xx.5, mask: 255.255.255.0, subnet: 192.xx.xx.0, mac: 00-26-55-52-75-32, ifname: bond0), state(gipcdadapterstateUp)])

2012-07-16 16:15:01.280: [GIPCHGEN][1108855104] gipchaNodeAddInterface: adding interface information for inf 0x2aaaac220350 { host '', haName 'gipcd_ha_name', local (nil), ip '192.xx.xx.5', subnet '192.xx.xx.0', mask '255.255.255.0', mac '00-26-55-52-75-32', ifname 'bond0', numRef 0, numFail 0, idxBoot 0, flags 0x1841 }

2012-07-16 16:15:01.280: [GIPCDCLT][1086585152] gipcdClientThread: Sent req to local client on endp 0000000000000366:(type(gipcdmsgtypeInterfaceUpdate), endp(0000000000000366), len(1032), buf(0x2aaab0000e60):[hostname(racnode1), req cookie(0), haname(),update(ip: 192.xx.xx.5, mask: 255.255.255.0, subnet: 192.xx.xx.0, mac: 00-26-55-52-75-32, ifname: bond0), state(gipcdadapterstateUp), retStatus(gipcretSuccess)])

..

2012-07-16 16:15:01.290: [GIPCDCLT][1086585152] gipcdClientThread: req from local client of type gipcdmsgtypeInterfacePublish, endp 0000000000000902

2012-07-16 16:15:01.290: [GIPCDCLT][1086585152] gipcdClientInterfacePublish: Received type(gipcdmsgtypeInterfacePublish), endp(0000000000000902), len(1032), buf(0x2aaab01efd28):[hostname(racnode1), req cookie(2891985472), haname(a523-7ff1-62c0-1781),update(ip: 192.xx.xx.5:21796, mask: 255.255.255.0, subnet: 192.xx.xx.0, mac: , ifname: ), state(gipcdadapterstateDown), retStatus(gipcretSuccess)]

2012-07-16 16:15:01.290: [GIPCDCLT][1086585152] gipcdClientInterfacePublish: Interface added ip: 192.xx.xx.5:21796, mask: 255.255.255.0, subnet: 192.xx.xx.0, mac: , ifname:

2012-07-16 16:15:01.290: [GIPCDCLT][1086585152] gipcdClientThread: Sent req to local client on endp 0000000000000902:(type(gipcdmsgtypeInterfacePublishAck), endp(0000000000000902), len(1032), buf(0x2aaab01eb690):[hostname(), req cookie(2891985472), haname(),update(), state(gipcdadapterstateDown), retStatus(gipcretSuccess)])

..

2012-07-16 16:15:02.262: [GIPCHTHR][1088686400] gipchaWorkerCreateInterface: created local interface for node 'racnode1', haName 'gipcd_ha_name', inf 'udp://192.xx.xx.5:32823'

2012-07-16 16:15:02.262: [GIPCHTHR][1088686400] gipchaWorkerCreateInterface: created local bootstrap multicast interface for node 'racnode1', haName 'gipcd_ha_name', inf 'mcast://224.0.0.251:42424/192.xx.xx.5'

2012-07-16 16:15:02.263: [GIPCHTHR][1088686400] gipchaWorkerCreateInterface: created local bootstrap multicast interface for node 'racnode1', haName 'gipcd_ha_name', inf 'mcast://230.0.1.0:42424/192.xx.xx.5'

2012-07-16 16:15:02.263: [GIPCHTHR][1088686400] gipchaWorkerCreateInterface: created local bootstrap broadcast interface for node 'racnode1', haName 'gipcd_ha_name', inf 'udp://192.xx.xx.255:42424'

2012-07-16 16:15:02.263: [GIPCHGEN][1088686400] gipchaNodeAddInterface: adding interface information for inf 0x2aaaac3210f0 { host 'racnode2', haName 'gipcd_ha_name', local (nil), ip '192.xx.xx.5:32823', subnet '192.xx.xx.0', mask '255.255.255.0', mac '', ifname '', numRef 0, numFail 0, idxBoot 0, flags 0x1842 }

2012-07-16 16:15:02.263: [GIPCHTHR][1088686400] gipchaWorkerCreateInterface: created remote bootstrap multicast interface for node 'racnode2', haName 'gipcd_ha_name', inf 'mcast://224.0.0.251:42424/192.xx.xx.5:32823'

2012-07-16 16:15:02.263: [GIPCHTHR][1088686400] gipchaWorkerCreateInterface: created remote bootstrap multicast interface for node 'racnode2', haName 'gipcd_ha_name', inf 'mcast://230.0.1.0:42424/192.xx.xx.5:32823'

2012-07-16 16:15:02.263: [GIPCHTHR][1088686400] gipchaWorkerCreateInterface: created remote bootstrap broadcast interface for node 'racnode2', haName 'gipcd_ha_name', inf 'udp://192.xx.xx.255:32823'

2012-07-16 16:15:02.263: [GIPCHGEN][1088686400] gipchaWorkerAttachInterface: Interface attached inf 0x2aaaac3210f0 { host 'racnode2', haName 'gipcd_ha_name', local 0x2aaaac220350, ip '192.xx.xx.5:32823', subnet '192.xx.xx.0', mask '255.255.255.0', mac '', ifname '', numRef 0, numFail 0, idxBoot 0, flags 0x1846 }

..

2012-07-16 16:15:06.284: [ CLSINET][1106753856] Returning NETDATA: 1 interfaces

2012-07-16 16:15:06.284: [ CLSINET][1106753856] # 0 Interface 'bond0',ip='192.xx.xx.5',mac='00-26-55-52-75-32',mask='255.255.255.0',net='192.xx.xx.0',use='cluster_interconnect'

..

2012-07-16 16:15:11.267: [GIPCDMON][1106753856] gipcdMonitorSaveInfMetrics: inf[ 0] bond0 - rank 0, avgms 30000000000.000000 [ 4 / 0 / 0 ] ##>> rank stayed 0 after network is restored

..

2012-07-16 16:23:11.363: [GIPCDMON][1106753856] gipcdMonitorSaveInfMetrics: inf[ 0] bond0 - rank 0, avgms 30000000000.000000 [ 19 / 0 / 0 ]

2012-07-16 16:23:11.463: [ CLSINET][1106753856] Returning NETDATA: 1 interfaces

2012-07-16 16:23:11.463: [ CLSINET][1106753856] # 0 Interface 'bond0',ip='192.xx.xx.5',mac='00-26-55-52-75-32',mask='255.255.255.0',net='192.xx.xx.0',use='cluster_interconnect'

..

2012-07-16 16:23:41.378: [GIPCDMON][1106753856] gipcdMonitorSaveInfMetrics: inf[ 0] bond0 - rank -1, avgms 30000000000.000000 [ 0 / 0 / 0 ] ##>> rank changed to "-1" 10 minutes after the failure although it was restored a few minutes earlier

2012-07-16 16:23:41.378: [GIPCDMON][1106753856] gipcdMonitorSaveInfMetrics: saving:

2012-07-16 16:23:41.474: [ CLSINET][1106753856] Returning NETDATA: 1 interfaces

2012-07-16 16:23:41.474: [ CLSINET][1106753856] # 0 Interface 'bond0',ip='192.xx.xx.5',mac='00-26-55-52-75-32',mask='255.255.255.0',net='192.xx.xx.0',use='cluster_interconnect'



<GI_HOME>/log/<node>/cssd/ocssd.log - node2

2012-07-16 16:13:10.052: [ CSSD][1111189824]clssnmPollingThread: node racnode1 (1) at 50% heartbeat fatal, removal in 14.180 seconds

2012-07-16 16:13:10.052: [ CSSD][1111189824]clssnmPollingThread: node racnode1 (1) is impending reconfig, flag 2491406, misstime 15820

2012-07-16 16:13:10.052: [ CSSD][1111189824]clssnmPollingThread: local diskTimeout set to 27000 ms, remote disk timeout set to 27000, impending reconfig status(1)

2012-07-16 16:13:10.052: [ CSSD][1090488640]clssnmvDHBValidateNCopy: node 1, racnode1, has a disk HB, but no network HB, DHB has rcfg 237506449, wrtcnt, 1214131, LATS 4294489160, lastSeqNo 0, uniqueness 1342439461, timestamp 1342447989/416463944

..

2012-07-16 16:13:24.243: [ CSSD][1087187264](:CSSNM00005:)clssnmvDiskKillCheck: Aborting, evicted by node racnode1, number 1, sync 237506449, stamp 416478344 ##>> node got evicted after network failure

2012-07-16 16:13:24.243: [ CSSD][1115920704]clssnmQueueClientEvent: Node[1] state = 5, birth = 237506444, unique = 1342439461

2012-07-16 16:13:24.243: [ CSSD][1087187264]###################################

2012-07-16 16:13:24.243: [ CSSD][1115920704]clssnmQueueClientEvent: Node[2] state = 3, birth = 237506448, unique = 1342447749

2012-07-16 16:13:24.243: [ CSSD][1102309696](:CSSNM00005:)clssnmvDiskKillCheck: Aborting, evicted by node racnode1, number 1, sync 237506449, stamp 416478344

2012-07-16 16:13:24.243: [ CSSD][1087187264]clssscExit: CSSD aborting from thread clssnmvKillBlockThread

..

2012-07-16 16:19:43.738: [ CSSD][1806091136]clssscmain: Starting CSS daemon, version 11.2.0.3.0, in (clustered) mode with uniqueness value 1342448383 ##>> cssd auto start upon OS reboot

..

2012-07-16 16:19:48.708: [ CSSD][1077565760]clssnmCompleteVFDiscovery: Committed configuration for CIN 0:1342040761:0

2012-07-16 16:19:48.708: [ CSSD][1077565760] misscount 30 reboot latency 3

2012-07-16 16:19:48.708: [ CSSD][1077565760] long I/O timeout 200 short I/O timeout 27

2012-07-16 16:19:48.708: [ CSSD][1077565760] diagnostic wait 0 active version 11.2.0.3.0

2012-07-16 16:19:48.708: [ CSSD][1077565760] Listing unique IDs for 3 voting files:

2012-07-16 16:19:48.708: [ CSSD][1077565760] voting file 1: 476c9740-71d74f55-bfc6991c-cae65a5b

2012-07-16 16:19:48.708: [ CSSD][1077565760] voting file 2: 0716c636-607e4f6a-bfcf0d1b-8ef14814

2012-07-16 16:19:48.708: [ CSSD][1077565760] voting file 3: 9fc2c40b-41674fdf-bf3295f9-ef04d8d9

2012-07-16 16:19:48.708: [ CSSD][1806091136]clssnmOpenGIPCEndp: opening cluster listener on gipc://racnode2:nm_racnode1_X45

..

2012-07-16 16:19:49.217: [ CLSINET][1092094272] Returning NETDATA: 1 interfaces

2012-07-16 16:19:49.217: [ CLSINET][1092094272] # 0 Interface 'bond0',ip='192.xx.xx.10',mac='18-a9-05-4c-6f-1a',mask='255.255.255.0',net='192.xx.xx.0',use='cluster_interconnect'

2012-07-16 16:19:49.217: [GIPCHGEN][1092094272] gipchaNodeAddInterface: adding interface information for inf 0x1d6e58f0 { host '', haName 'CSS_racnode1_X45', local (nil), ip '192.xx.xx.10', subnet '192.xx.xx.0', mask '255.255.255.0', mac '18-a9-05-4c-6f-1a', ifname 'bond0', numRef 0, numFail 0, idxBoot 0, flags 0x1841 }

2012-07-16 16:19:49.217: [GIPCHTHR][1090517312] gipchaWorkerCreateInterface: created local interface for node 'racnode2', haName 'CSS_racnode1_X45', inf 'udp://192.xx.xx.10:37726'

2012-07-16 16:19:49.217: [GIPCHTHR][1090517312] gipchaWorkerCreateInterface: created local bootstrap multicast interface for node 'racnode2', haName 'CSS_racnode1_X45', inf 'mcast://224.0.0.251:42424/192.xx.xx.10'

2012-07-16 16:19:49.218: [GIPCHTHR][1090517312] gipchaWorkerCreateInterface: created local bootstrap multicast interface for node 'racnode2', haName 'CSS_racnode1_X45', inf 'mcast://230.0.1.0:42424/192.xx.xx.10'

2012-07-16 16:19:49.218: [GIPCHTHR][1090517312] gipchaWorkerCreateInterface: created local bootstrap broadcast interface for node 'racnode2', haName 'CSS_racnode1_X45', inf 'udp://192.xx.xx.255:42424'

2012-07-16 16:19:49.218: [ CSSD][1806091136]clssnmOpenGIPCEndp: listening on gipcha://racnode2:nm2_racnode1_X45

..

2012-07-16 16:29:37.835: [ CSSD][1108764992]clssnmvDHBValidateNCopy: node 1, racnode1, has a disk HB, but no network HB, DHB has rcfg 237506450, wrtcnt, 1217177, LATS 4294841470, lastSeqNo 1217167, uniqueness 1342439461, timestamp 1342448976/417450424

2012-07-16 16:29:38.039: [ CSSD][1103366464]clssnmvDHBValidateNCopy: node 1, racnode1, has a disk HB, but no network HB, DHB has rcfg 237506450, wrtcnt, 1217178, LATS 4294841680, lastSeqNo 1217176, uniqueness 1342439461, timestamp 1342448977/417451294

2012-07-16 16:29:38.194: [ CSSD][1075505472]clssnmvDHBValidateNCopy: node 1, racnode1, has a disk HB, but no network HB, DHB has rcfg 237506450, wrtcnt, 1217179, LATS 4294841830, lastSeqNo 1217169, uniqueness 1342439461, timestamp 1342448977/417451314 ##>> message repeated for about 10 minutes

..

2012-07-16 16:29:41.806: [ CSSD][1116649792]clssnmRcfgMgrThread: Local Join

2012-07-16 16:29:41.806: [ CSSD][1116649792]clssnmLocalJoinEvent: begin on node(2), waittime 193000

2012-07-16 16:29:41.806: [ CSSD][1116649792]clssnmLocalJoinEvent: set curtime (4294845440) for my node

2012-07-16 16:29:41.806: [ CSSD][1116649792]clssnmLocalJoinEvent: scanning 32 nodes

2012-07-16 16:29:41.806: [ CSSD][1116649792]clssnmLocalJoinEvent: Node racnode1, number 1, is in an existing cluster with disk state 3

2012-07-16 16:29:41.806: [ CSSD][1116649792]clssnmLocalJoinEvent: takeover aborted due to cluster member node found on disk ##>> message repeated for about 10 minutes

..

2012-07-16 16:29:42.783: [ CSSD][1085380928]clssgmExecuteClientRequest: MAINT recvd from proc 2 (0x1d4527f0)

2012-07-16 16:29:42.783: [ CSSD][1085380928]clssgmShutDown: Received abortive shutdown request from client.

2012-07-16 16:29:42.783: [ CSSD][1085380928]###################################

2012-07-16 16:29:42.783: [ CSSD][1085380928]clssscExit: CSSD aborting from thread GMClientListener

2012-07-16 16:29:42.783: [ CSSD][1085380928]###################################

2012-07-16 16:29:42.783: [ CSSD][1085380928](:CSSSC00012:)clssscExit: A fatal error occurred and the CSS daemon is terminating abnormally

..

2012-07-16 16:29:49.592: [ CSSD][2254918528]clssscmain: Starting CSS daemon, version 11.2.0.3.0, in (clustered) mode with uniqueness value 1342448989



<GI_HOME>/log/<node>/gipcd/gipcd.log - node2

..

2012-07-16 16:19:53.621: [ CLSINET][1108175168] Returning NETDATA: 1 interfaces

2012-07-16 16:19:53.621: [ CLSINET][1108175168] # 0 Interface 'bond0',ip='192.xx.xx.10',mac='18-a9-05-4c-6f-1a',mask='255.255.255.0',net='192.xx.xx.0',use='cluster_interconnect'

2012-07-16 16:19:53.621: [GIPCDMON][1108175168] gipcdMonitorUpdate: New Interface found - [ ip 192.xx.xx.10, subnet 192.xx.xx.0, mask 255.255.255.0, mac 18-a9-05-4c-6f-1a, ifname bond0]

2012-07-16 16:19:53.621: [GIPCDMON][1108175168] gipcdMonitorUpdate: msg sent to client thread (([update(ip: 192.xx.xx.10, mask: 255.255.255.0, subnet: 192.xx.xx.0, mac: 18-a9-05-4c-6f-1a, ifname: bond0), state(gipcdadapterstateUp)]))

2012-07-16 16:19:53.621: [GIPCDNDE][1108175168] gipcdEnqueueMsgForClient: Enqueuing a msg for client : req:[hostname(), id (0000000000000000, 0000000000000000), len(632), req cookie(0), type(gipcdReqTypeMonitorUpdate)]

2012-07-16 16:19:53.621: [GIPCDMON][1108175168] gipcdMonitorSaveInfMetrics: inf[ 0] bond0 - rank -1, avgms 30000000000.000000 [ 0 / 0 / 0 ]

2012-07-16 16:19:53.621: [GIPCDMON][1108175168] gipcdMonitorSaveInfMetrics: saving:

..

2012-07-16 16:20:04.600: [GIPCDMON][1108175168] gipcdMonitorCssInit: Successfully connected to CSS

2012-07-16 16:20:08.593: [ CLSINET][1108175168] Returning NETDATA: 1 interfaces

2012-07-16 16:20:08.593: [ CLSINET][1108175168] # 0 Interface 'bond0',ip='192.xx.xx.10',mac='18-a9-05-4c-6f-1a',mask='255.255.255.0',net='192.xx.xx.0',use='cluster_interconnect'

..

2012-07-16 16:20:23.632: [GIPCDMON][1108175168] gipcdMonitorSaveInfMetrics: inf[ 0] bond0 - rank 0, avgms 30000000000.000000 [ 15 / 0 / 0 ]

2012-07-16 16:20:23.632: [GIPCDMON][1108175168] gipcdMonitorSaveInfMetrics: saving: bond0:0



OS messages could be different depending on platform     

hp-ux


Aug 22 19:48:12 mosr4d1 vmunix: APA/LM: FOG:lan900 is down

Aug 22 19:48:12 mosr4d1 cmnetd[3940]: lan900 is down at the data link layer.

Aug 22 19:48:12 mosr4d1 cmnetd[3940]: lan900 failed.

Aug 22 19:48:12 mosr4d1 cmnetd[3940]: Subnet 192.57.253.0 down




AIX


LABEL:  ECH_CHAN_RCVRY

..

Date/Time:       Wed Aug  8 10:57:05 2012

..

Description

ETHERCHANNEL RECOVERY

..

Detail Data

A primary adapter in the EtherChannel recovered: returning to main channel


---------------------------------------------------------------------------


LABEL:  GOENT_RCVRY_EXIT

..

Date/Time:       Wed Aug  8 10:57:05 2012

..

Description

ETHERNET NETWORK RECOVERY MODE



---------------------------------------------------------------------------

LABEL:  ECH_TOTAL_CHAN_FAIL

..

Date/Time:       Wed Aug  8 10:30:26 2012

..

Description

TOTAL ETHERCHANNEL FAILURE

..

Detail Data

Whole EtherChannel is down: backup adapter failed while operating on backup


---------------------------------------------------------------------------

LABEL:  GOENT_LINK_DOWN

..

Date/Time:       Wed Aug  8 10:30:26 2012

..

Description

ETHERNET DOWN


---------------------------------------------------------------------------

LABEL:  ECH_CHAN_FAIL

..

Date/Time:       Wed Aug  8 10:30:23 2012

..

Description

ETHERCHANNEL FAILOVER

..

Detail Data

All primary EtherChannel adapters failed: switching over to backup adapter






CAUSE

Refer to Solution section




SOLUTION

#1: Due to bug 13653178 - Node cannot join the cluster after reboot or "interconnect restored"

Duplicates bug 14281269 bug 14486138 bug 14534071 bug 13638540 bug 14403174 bug 15941470 bug 15922981


The bug is fixed in 11.2.0.3.5 PSU and 11.2.0.4. The bug fix has been superseded by Bug 16547309. It has been fixed in 11.2.0.3.8 PSU, 11.2.0.4, 12.1.0.1.3 PSU and 12.1.0.2.


It's recommended to apply latest patch set and patch set update. For more information, please check out Note 1564555.1 11.2.0.3 PSU5/PSU6/PSU7 or 12.1.0.1 CSSD Fails to Start if Multicast Fails on Private Network.




#2: Due to bug 13899736 - Node cannot join the cluster after reboot or "interconnect restored"

Duplicates bug 15881474 bug 15827981 bug 14791084 bug 14532170 bug 16391695 bug 16464829 bug 16484124 bug 16560299


The bug is fixed in 11.2.0.3 GI PSU6, 11.2.0.4, 12.1.0.2 onward, it's recommended to apply latest patch set and patch set update.




The workaround for both #1 and #2 is:


Step1. On surviving node, during non-peak time if possible, kill gipcd.bin process (kill -15 <gipcd.bin ospid>)


NOTE:  In 11.2 will also lead to death of evmd.bin, crsd.bin and ctssd.bin processes. None of these processes are fatal. Clusterware will respawn all them automatically.


Step2. Once gipcd.bin, evmd.bin, crsd.bin and ctssd.bin processes have been re-spawned on the surviving node, verify whether other nodes join cluster.


Step3. Most of the time, GI will start, but in case it does not, re-start GI on the other nodes with crsctl command.


Step4. Finally, if GI is still not starting, as a last resort, restart GI on the surviving node.




If GI is on 11.2.0.3PSU5 to 11.2.0.3PSU7 or 12.1.0.1.0 to 12.1.0.1 PSU2, please ensure multicast is working on the private network.








REFERENCES


NOTE:1564555.1 - 11.2.0.3 PSU5/PSU6/PSU7 or 12.1.0.1 CSSD Fails to Start if Multicast Fails on Private Network

BUG:16547309 - GIPC SHOWS RANK 0 OR -1 AFTER APPLIED PSU 11.2.0.3.5

NOTE:1352887.1 - 11gR2 Grid Infrastructure Node May not Join the Cluster After Evicted With Error sgipcnUdpSend "No buffer space available (74)"

NOTE:1374360.1 - HAIP Fails to Start as GIPC Has Rank 0 or -1 for Interconnect


分享到:  QQ好友和群QQ好友和群 QQ空间QQ空间 腾讯微博腾讯微博 腾讯朋友腾讯朋友
收藏收藏 支持支持 反对反对
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 注册

本版积分规则

QQ|手机版|小黑屋|重庆思庄Oracle、Redhat认证学习论坛 ( 渝ICP备12004239号-4 )

GMT+8, 2024-6-17 22:55 , Processed in 0.100828 second(s), 21 queries .

重庆思庄学习中心论坛-重庆思庄科技有限公司论坛

© 2001-2020

快速回复 返回顶部 返回列表