Troubleshooting Instance Evictions (Instance terminates with ORA-29740, Insta

郑全 · 发表于 2014-11-14 10:02:03

In this Document

Purpose

Troubleshooting Steps

Background

What is an instance eviction?

Why do instances get evicted?

How can I tell that I have had an instance eviction?

What is the most common cause of instance evictions?

Key files for troubleshooting instance evictions

Steps to diagnose instance evictions

Step 1. Look in the alert logs from all instances for eviction message.

Step 2. For ora-29740, check the lmon traces for eviction reason.

1. Find the reason for the reconfiguration.

2. Understand the reconfiguration reason

Step 3. Review alert logs for additional information.

1. "IPC Send Timeout"

2. "Waiting for clusterware split-brain resolution" or "Detected an inconsistent instance membership"

3. "<PROCESS> detected no messaging activity from instance <n>"

4. None of the above

Step 4. Checks to carry out based on the findings of steps 1, 2, 3.

4(a) - Network checks.

4(b) - Check for OS hangs or severe resource contention at the OS level.

4(c) - Check for database or process hang.

Known Issues

References

APPLIES TO:

Oracle Database - Enterprise Edition - Version 9.2.0.1 and later

Information in this document applies to any platform.

PURPOSE

Purpose: Understanding and Troubleshooting Instance Evictions.

Symptoms of an instance eviction: Instance terminates with ORA-29740, Instance Abort, Instance Kill

TROUBLESHOOTING STEPS

Background

What is an instance eviction?

A RAC database has several instances.

In an instance eviction, one or more instances are suddenly aborted ("evicted").

The decision to evict these instance(s) is by mutual consensus of all the instances.

Why do instances get evicted?

Prevent problems from occuring that would affect the entire clustered database.

Evict an unresponsive instance instead of allowing cluster-wide hang to occur

Evict an instance which can't communicate with other instances to avoid a "split brain" situation - in other words, to preserve cluster consistency.

How can I tell that I have had an instance eviction?

The instance will be shut down abruptly.

In most cases, the alert log will contain this message:

ORA-29740: evicted by instance number <n>, group incarnation <n>

In a few cases, the ORA-29740 message will not be present, and this message will show instead:

"Received an instance abort message from instance 1"

What is the most common cause of instance evictions?

The most common reason is a communications failure.

The Oracle background processes communicate with each other across the private interconnect.

If other instances cannot communicate with one instance, an instance is evicted

This is known as a Communications reconfiguration.

Chief causes of the communications failure: Network issues, OS Load issues

Key files for troubleshooting instance evictions

1. Alert log from each instance.

2. LMON trace file from each instance.

Steps to diagnose instance evictions

1. Look in the alert logs from all instances for eviction message.

2. For ora-29740, check the lmon traces for eviction reason.

3. Review alert logs for additional information.

4. Checks to carry out based on the findings of steps 1, 2, 3.

Step 1. Look in the alert logs from all instances for eviction message.

Look for the following messages in the alert log:

a) Look for ora-29740 in the alert log of the instance that got restarted.

Example:

ORA-29740: evicted by instance number 2, group incarnation 24

If ora-29740 is found, this means that LMON of the evicted instance terminated it.

b) If no ora-29740, look for the following messages:

In the evicted instance:

"Received an instance abort message from instance 1"

In one of the surviving instances:

"Remote instance kill is issued" in the killing instance

This means that the instance was evicted by another instance, but its LMON did not terminate it with ora-29740.

This is usually an indication that LMON on the evicted instance was busy or not progressing. If you see this symptom (Received an instance abort / Remote instance kill is issued), carry out the checks in Step 4(b) and 4(c).

Step 2. For ora-29740, check the lmon traces for eviction reason.

1. Find the reason for the reconfiguration.

Check the lmon traces for all instances for a line with "kjxgrrcfgchk: Initiating reconfig".

This will give a reason code such as "kjxgrrcfgchk: Initiating reconfig, reason 3".

Note: make sure that the timestamp of this line is shortly before the time of the ORA-29740.

There is a reconfiguration every time an instance joins or leaves the cluster; reason 1 or 2.

So make sure that you have found the right reconfiguration in the LMON trace.

2. Understand the reconfiguration reason

The reconfiguration reasons are:

Reason 0 = No reconfiguration

Reason 1 = The Node Monitor generated the reconfiguration.

Reason 2 = An instance death was detected.

Reason 3 = Communications Failure

Reason 4 = Reconfiguration after suspend

For ora-29740, by far the most common reconfiguration reason is Reason 3 = Communications Failure

This means that the background processes of one or more instances have registered a communication problem with each other.

Note: All the instances of a RAC database need to be in constant communication with each other over the interconnect in order to preserve database integrity.

In 11.2, the alert log may also print the following for a Reason 3 reconfiguration:

Communications reconfiguration: instance_number <n>

If you see this symptom (Reason 3 or Communications reconfiguration), carry out the checks in Step 4(a) - Network checks.

If you find a different reconfiguration reason, double check to make sure that you have got the right reconfiguration, ie. the last "kjxgrrcfgchk" message before the ora-29740 occurred. See Document 219361.1 for more information on the other reconfiguration reasons.

Step 3. Review alert logs for additional information.

Look for any of the following messages in the alert log of any instance, shortly before the eviction:

1. "IPC Send Timeout"

Example: Instance 1's alert log shows:

IPC Send timeout detected.Sender: ospid 1519

Receiver: inst 8 binc 997466802 ospid 23309

This means that Instance 1's process with OS pid 1519 was trying to send a message to Instance 8's process with OS pid 23309. Instance 1 ospid 1519 timed out while waiting for acknowledgement from Instance 8 ospid 23309

To find out which background process corresponds to each ospid, look BACKWARDS in the corresponding alert log to the PRIOR startup. The ospid's of all background processes are listed at instance startup.

Example:

Thu Apr 25 16:35:41 2013

LMON started with pid=11, OS id=15510

Thu Apr 25 16:35:41 2013

LMD0 started with pid=12, OS id=15512

Thu Apr 25 16:35:41 2013

LMS0 started with pid=13, OS id=15514 at elevated priority

Broadly speaking, there are 2 kinds of reason to see IPC send timeout messages in the alert log:

(1) Network problem with communication over the interconnect, so the IPC message does not get through.

(2) The sender or receiver process is not progressing. This could be caused by OS load or scheduling problem, or by database/process hang or blocked at DB wait level.

If you see this symptom, carry out all of the checks in Section 4.

* At the OS level and/or hanganalyze level, focus particularly on the PIDs printed in the IPC send timeout.

* Also, check the trace files for the processes whose PIDs are printed in the IPC send timeout.

* In 11.1 and above, also check the LMHB trace with a focus on these processes.

2. "Waiting for clusterware split-brain resolution" or "Detected an inconsistent instance membership"

These messages are sometimes seen in a communications reconfiguration.

Example 1:

Mon Dec 07 19:43:07 2011

Communications reconfiguration: instance_number 2

Mon Dec 07 19:43:07 2011

Trace dumping is performing id=[cdmp_20091207194307]

Waiting for clusterware split-brain resolution

Example 2:

Thu Mar 07 17:08:03 2013

Detected an inconsistent instance membership by instance 2

Either of these messages indicates a split-brain situation. This indicates a sustained and severe problem with communication between instances over the interconnect.

See the following note to understand split-brain further:

Document 1425586.1 - What is Split Brain in Oracle Clusterware and Real Application Cluster

If you see this symptom, carry out the checks in step 4(a) - Network.

3. "<PROCESS> detected no messaging activity from instance <n>"

Example:

LMS0 (ospid: 2431) has detected no messaging activity from instance 1

LMS0 (ospid: 2431) issues an IMR to resolve the situation

This means that the background process (LMS0 in the above example) has not received any messages from the other instance for a sustained period of time. It is a strong indication that either there are network problems on the interconnect, or the other instance is hung.

If you see this symptom, carry out the checks in step 4(a) first, if no issues found, check 4(b) and 4(c).

4. None of the above

If none of the above messages are seen in the alert log, but you have seen ora-29740 in the alert log, then carry out all the checks in section 4, starting with 4(a) - Network checks.

Step 4. Checks to carry out based on the findings of steps 1, 2, 3.

Note: In the following, OSW refers to OS Watcher (Document 301137.1), and CHM refers to Cluster Health Monitor(Document 1328466.1).

If you are experiencing repeated instance evictions, you will need to be able to retrospectively examine the OS statistics from the time of the eviction. If CHM is available on your platform and version, you can use CHM; make sure to review the results before they expire out of the archive. Otherwise, Oracle Support recommends that you install and run OS Watcher to facilitate diagnosis.

4(a) - Network checks.

* Check network and make sure there is no network error such as UDP error or IP packet loss or failure errors.

* Check network configuration to make sure that all network configurations are set up correctly on all nodes.

For example, MTU size must be same on all nodes and the switch can support MTU size of 9000 if jumbo frame is used.

* Check archived "netstat" results in OSW or CHM. By default, the database communicates over the interconnect using UDP. Look for any increase in IP or UDP errors, drops, fragments not reassembled, etc.

* If OSW is in use, check archived "oswprvtnet" for any interruption in the traceroutes over private interconnect. See Document 301137.1 for more information.

4(b) - Check for OS hangs or severe resource contention at the OS level.

* Check archived vmstat and top results in OSW or CHM to see if the server had a CPU or memory load problem, network problem, or spinning lmd or lms processes.

4(c) - Check for database or process hang.

* Check in the alert log to see if any hanganalyze dump was taken prior to the ora-29740, as instance or process hangs can trigger automatic hanganalyze dump. If hanganalyze dump was taken, see Document 390374.1 for more information on interpreting the dump.

* Check in the alert log or with dba to see if a systemstate dump was taken prior to the ora-29740. If so, Oracle Support can assist in analysing the systemstate dump.

* Check archived OS statistics in OSW or CHM to see if any LM* background process was spinning.

Known Issues

Document 1440892.1 - 11gR2: LMON received an instance eviction notification from instance n

REFERENCES

NOTE:1425586.1 - What is Split Brain in Oracle Clusterware and Real Application Cluster

NOTE:1374110.1 - Top 5 issues for Instance Eviction

NOTE:1440892.1 - 11gR2: LMON received an instance eviction notification from instance n

NOTE:390374.1 - Oracle Performance Diagnostic Guide (OPDG)

NOTE:219361.1 - Troubleshooting ORA-29740 in a RAC Environment

帐号		自动登录	找回密码
密码			注册