重庆思庄Oracle、Redhat认证学习论坛

 找回密码
 注册

QQ登录

只需一步,快速开始

搜索
查看: 3039|回复: 3
打印 上一主题 下一主题

[安装] 11g rac 在非第一个节点执行root.sh 都 失败

[复制链接]
跳转到指定楼层
楼主
发表于 2020-6-23 15:51:49 | 只看该作者 回帖奖励 |倒序浏览 |阅读模式
Adding Clusterware entries to oracle-ohasd.service
CRS-4402: The CSS daemon was started in exclusive mode but found an active CSS daemon on node rac-r-1, number 1, and is terminating
An active cluster was found during exclusive startup, restarting to join the cluster
Start of resource "ora.asm" failed
CRS-2672: Attempting to start 'ora.asm' on 'rac-r-2'
CRS-5017: The resource action "ora.asm start" encountered the following error:
ORA-03113: end-of-file on communication channel
Process ID: 0
Session ID: 0 Serial number: 0
. For details refer to "(:CLSN00107:)" in "/u01/app/11.2.0/grid/log/rac-r-2/agent/ohasd/oraagent_grid/oraagent_grid.log".
CRS-2674: Start of 'ora.asm' on 'rac-r-2' failed
CRS-2679: Attempting to clean 'ora.asm' on 'rac-r-2'
CRS-2681: Clean of 'ora.asm' on 'rac-r-2' succeeded
CRS-4000: Command Start failed, or completed with errors.
Failed to start Oracle Grid Infrastructure stack
Failed to start ASM at /u01/app/11.2.0/grid/crs/install/crsconfig_lib.pm line 1339.
/u01/app/11.2.0/grid/perl/bin/perl -I/u01/app/11.2.0/grid/perl/lib -I/u01/app/11.2.0/grid/crs/install /u01/app/11.2.0/grid/crs/install/rootcrs.pl execution failed
分享到:  QQ好友和群QQ好友和群 QQ空间QQ空间 腾讯微博腾讯微博 腾讯朋友腾讯朋友
收藏收藏 支持支持 反对反对
回复

使用道具 举报

沙发
 楼主| 发表于 2020-6-23 15:52:17 | 只看该作者
手工去执行启动 ASM,也失败。


回复 支持 反对

使用道具 举报

板凳
 楼主| 发表于 2020-6-23 15:53:05 | 只看该作者
郑全 发表于 2020-6-23 15:52
手工去执行启动 ASM,也失败。

[size=130%]ASM on Non-First Node (Second or Others) Fails to Start: PMON (ospid: nnnn): terminating the instance due to error 481 (文档 ID 1383737.1)

                               
登录/注册后可看大图



                               
登录/注册后可看大图

In this Document

Purpose
Scope
Details
Case1: link local IP (169.254.x.x) is being used by other adapter/network
Case2: firewall exists between nodes on private network (iptables etc)
Case3: HAIP is up on some nodes but not on all
Case4: HAIP is up on all nodes but some do not have route info
Case5. HAIP is up on all nodes and route info is presented but HAIP is not pingable
References


Applies to:  Oracle Database - Enterprise Edition - Version 11.2.0.1 and later
Oracle Database Exadata Cloud Machine - Version N/A and later
Oracle Cloud Infrastructure - Database Service - Version N/A and later
Oracle Database Cloud Exadata Service - Version N/A and later
Oracle Database Cloud Schema Service - Version N/A and later
Information in this document applies to any platform.
PurposeThis note lists common causes of ASM start up failure with the following error on non-first node (second or others):
  • alert_<ASMn>.log from non-first node
lmon registered with NM - instance number 2 (internal mem no 1)
Tue Dec 06 06:16:15 2011
System state dump requested by (instance=2, osid=19095 (PMON)), summary=[abnormal instance termination].
System State dumped to trace file /g01/app/oracle/diag/asm/+asm/+ASM2/trace/+ASM2_diag_19138.trc
Tue Dec 06 06:16:15 2011
PMON (ospid: 19095): terminating the instance due to error 481
Dumping diagnostic data in directory=[cdmp_20111206061615], requested by (instance=2, osid=19095 (PMON)), summary=[abnormal instance termination].
Tue Dec 06 06:16:15 2011
ORA-1092 : opitsk aborting process

Note: ASM instance terminates shortly after "lmon registered with NM"
If ASM on non-first node was running previously, likely the following will be in alert.log when it failed originally:
..
IPC Send timeout detected. Sender: ospid 32231 [oracle@ftdcslsedw01b (PING)]
..
ORA-29740: evicted by instance number 1, group incarnation 10
..

  • diag trace from non-first ASM (+ASMn_diag_<pid>.trc)
kjzdattdlm: Can not attach to DLM (LMON up=[TRUE], DB mounted=[FALSE]).
kjzdattdlm: Can not attach to DLM (LMON up=[TRUE], DB mounted=[FALSE])

  • alert_<ASMn>.log from first node
LMON (ospid: 15986) detects hung instances during IMR reconfiguration
LMON (ospid: 15986) tries to kill the instance 2 in 37 seconds.
Please check instance 2's alert log and LMON trace file for more details.
..
Remote instance kill is issued with system inc 64
Remote instance kill map (size 1) : 2
LMON received an instance eviction notification from instance 1
The instance eviction reason is 0x20000000
The instance eviction map is 2
Reconfiguration started (old inc 64, new inc 66)

If the issue happens while running root script (root.sh or rootupgrade.sh) as part of Grid Infrastructure installation/upgrade process, the following symptoms will present:
  • root script screen output
Start of resource "ora.asm" failed

CRS-2672: Attempting to start 'ora.asm' on 'racnode1'
CRS-5017: The resource action "ora.asm start" encountered the following error:
ORA-03113: end-of-file on communication channel
Process ID: 0
Session ID: 0 Serial number: 0
. For details refer to "(:CLSN00107:)" in "/ocw/grid/log/racnode1/agent/ohasd/oraagent_grid/oraagent_grid.log".
CRS-2674: Start of 'ora.asm' on 'racnode1' failed
..
Failed to start ASM at /g01/app/11.2.0.3/crs/install/crsconfig_lib.pm line 1272
  • $GRID_HOME/cfgtoollogs/crsconfig/rootcrs_<nodename>.log
2011-11-29 15:56:48: Executing cmd: /g01/app/11.2.0.3/bin/crsctl start resource ora.asm -init
..
>  CRS-2672: Attempting to start 'ora.asm' on 'racnode1'
>  CRS-5017: The resource action "ora.asm start" encountered the following error:
>  ORA-03113: end-of-file on communication channel
>  Process ID: 0
>  Session ID: 0 Serial number: 0
>  . For details refer to "(:CLSN00107:)" in "/g01/app/11.2.0.3/log/racnode1/agent/ohasd/oraagent_grid/oraagent_grid.log".
>  CRS-2674: Start of 'ora.asm' on 'racnode1' failed
>  CRS-2679: Attempting to clean 'ora.asm' on 'racnode1'
>  CRS-2681: Clean of 'ora.asm' on 'racnode1' succeeded
..
>  CRS-4000: Command Start failed, or completed with errors.
>End Command output
2011-11-29 15:59:00: Executing cmd: /g01/app/11.2.0.3/bin/crsctl check resource ora.asm -init
2011-11-29 15:59:00: Executing cmd: /g01/app/11.2.0.3/bin/crsctl status resource ora.asm -init
2011-11-29 15:59:01: Checking the status of ora.asm
..
2011-11-29 15:59:53: Start of resource "ora.asm" failed
  • For 12.1.0.2, the root.sh on the 2nd node could report:
    PRVG-6056 : Insufficient ASM instances found.  Expected 2 but found 1, on nodes "racnode2".

Scope
DetailsCase1: link local IP (169.254.x.x) is being used by other adapter/networkSymptoms:
  • $GRID_HOME/log/<nodename>/alert<nodename>.log
[/ocw/grid/bin/orarootagent.bin(4813)]CRS-5018:(:CLSN00037:) Removed unused HAIP route:  169.254.x.x / 255.255.255.0 / 0.0.0.0 / usb0
  • OS messages (optional)
Dec  6 06:11:14 racnode1 dhclient: DHCPREQUEST on usb0 to 255.255.255.255 port 67
Dec  6 06:11:14 racnode1 dhclient: DHCPACK from 169.254.x.x
  • ifconfig -a
..
usb0      Link encap:Ethernet  HWaddr E6:1F:13:AD:EE:D3
        inet addr:169.254.x.x  Bcast:169.254.95.255  Mask:255.255.255.0
..

Note: it's usb0 in this case, but it can be any other adapter which uses link local

Solution:


Link local IP must not be used by any other network on cluster nodes. In this case, an USB network device gets IP 169.254.x.x from DHCP server which disrupted HAIP routing, and solution is to black list the device in udev from being activated automatically.
Dell iDRAC service module may use link local, engage Dell to change the subnet.
On Sun T series, by default, ILOM (adapter name usbecm0) uses link local, engage Oracle Support for advice.



Case2: firewall exists between nodes on private network (iptables etc)No firewall is allowed on private network (cluster_interconnect) between nodes including software firewall like iptables, ipmon etc
Case3: HAIP is up on some nodes but not on allSymptoms:
  • alert_<+ASMn>.log for some instances
Cluster communication is configured to use the following interface(s) for this instance
10.x.x.x
  • alert_<+ASMn>.log for other instances
Cluster communication is configured to use the following interface(s) for this instance
169.254.x.x

Note: some instances is using HAIP while others are not, so they can not talk to each other
Solution:

The solution is to bring up HAIP on all nodes.

To find out HAIP status, execute the following on all nodes:
$GRID_HOME/bin/crsctl stat res ora.cluster_interconnect.haip -init

If it's offline, try to bring it up as root:
$GRID_HOME/bin/crsctl start res ora.cluster_interconnect.haip -init

If HAIP fails to start, refer to Note 1210883.1 for known issues. Once HAIP is restarted, ASM/DB instances need to be restarted to use HAIP; if OCR is on ASM DG, GI needs to be restarted.


If the "up node" is not using HAIP, and no outage is allowed, the workaround is to set init.ora/spfile parameter cluster_interconnect to the private IP of each node to allow ASM/DB to come up on "down node". Once a maintenance window is planned, the parameter must be removed to allow HAIP to work.

The following article may assist in determining the reason for the failure to start HAIP:

       note 1640865.1 - Known Issues: Grid Infrastructure Redundant Interconnect and ora.cluster_interconnect.haip
If the issue happened in the middle of GI upgrade, refer to:
      note 2063676.1 - rootupgrade.sh fails on node1 as HAIP was not starting from old home but starting from new home
Case4: HAIP is up on all nodes but some do not have route infoSymptoms:
  • alert_<+ASMn>.log for all instances
Cluster communication is configured to use the following interface(s) for this instance
169.254.x.x
  • "netstat -rn" for some nodes (surviving nodes) missing HAIP route
netstat -rn
Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
<IP ADDRESS>     0.0.0.0         255.255.248.0   U         0 0          0 bond0
<IP ADDRESS>     0.0.0.0         255.255.255.0   U         0 0          0 bond2
0.0.0.0      <IP ADDRESS>     0.0.0.0         UG        0 0          0   bond0

The line for HAIP is missing, i.e:

169.254.x.x     0.0.0.0         255.255.0.0     U         0 0          0 bond2

Note: As HAIP route info is missing on some nodes, HAIP is not pingable; usually newly restarted node will have HAIP route info
Solution:

The solution is to manually add HAIP route info on the nodes that's missing:

4.1. Execute "netstat -rn" on any node that has HAIP route info and locate the following:
169.254.0.0     0.0.0.0         255.255.0.0     U         0 0          0 bond2
Note: the first field is HAIP subnet ID and will start with 169.254.xxx.xxx, the third field is HAIP subnet netmask and the last field is private network adapter name


4.2. Execute the following as root on the node that's missing HAIP route:
# route add -net <HAIP subnet ID> netmask <HAIP subnet netmask> dev <private network adapter>

i.e.

# route add -net 169.254.x.x netmask 255.255.0.0 dev bond2

4.3. Start ora.crsd as root on the node that's partial up:.
# $GRID_HOME/bin/crsctl start res ora.crsd -init

The other workaround is to restart GI on the node that's missing HAIP route with "crsctl stop crs -f" and "crsctl start crs" command as root.



Case5. HAIP is up on all nodes and route info is presented but HAIP is not pingableSymptom:
HAIP is presented on both nodes and route information is also presented, but both nodes can not ping or traceroute against the other node HAIP.
[oracle@racnode2 script]$ netstat -r
Kernel IP routing table
Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
192.168.x.x     *               255.255.255.0   U         0 0          0 eth2
192.168.x.x     *               255.255.255.0   U         0 0          0 eth1
192.168.x.x     *               255.255.255.0   U         0 0          0 eth0
link-local      *               255.255.0.0     U         0 0          0 eth2
default         192.168.x.x     0.0.0.0         UG        0 0          0 eth0

[oracle@racnode1 trace]$ netstat -r
Kernel IP routing table
Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
192.168.x.x     *               255.255.255.0   U         0 0          0 eth2
192.168.x.x     *               255.255.255.0   U         0 0          0 eth1
192.168.x.x     *               255.255.255.0   U         0 0          0 eth0
link-local      *               255.255.0.0     U         0 0          0 eth2
default         192.168.x.x     0.0.0.0         UG        0 0          0 eth0

[oracle@racnode2 script]$ ping 169.254.x.x
PING 169.254.x.x (169.254.x.x) 56(84) bytes of data.

^C
--- 169.254.x.x ping statistics ---
39 packets transmitted, 0 received, 100% packet loss, time 38841ms

[oracle@racnode1 trace]$ ping 169.254.x.x
PING 169.254.x.x (169.254.x.x) 56(84) bytes of data.

^C
--- 169.254.x.x ping statistics ---
35 packets transmitted, 0 received, 100% packet loss, time 34555ms


Solution:
For Openstack Cloud implementation, engage system admin to create another neutron port to map link-local traffic. For other environment, engage SysAdmin/NetworkAdmin to review routing/network setup.






回复 支持 反对

使用道具 举报

地板
 楼主| 发表于 2020-6-23 15:54:05 | 只看该作者
最终结果 心跳防火墙导致,关闭后,问题解决。
回复 支持 反对

使用道具 举报

您需要登录后才可以回帖 登录 | 注册

本版积分规则

QQ|手机版|小黑屋|重庆思庄Oracle、Redhat认证学习论坛 ( 渝ICP备12004239号-4 )

GMT+8, 2024-5-15 19:39 , Processed in 0.091288 second(s), 19 queries .

重庆思庄学习中心论坛-重庆思庄科技有限公司论坛

© 2001-2020

快速回复 返回顶部 返回列表