Troubleshooting: Missing Automatic Workload Repository (AWR) Snapshots and Other Collection Issues (Doc ID 1301503.1)
In this Document
Purpose
Troubleshooting Steps
Detecting Missing snapshots
Snapshots creation does not start
Database Settings and Status
STATISTICS_LEVEL setting
Database is Closed
Database Open Mode
Snapshots creation starts but does not complete
General Reasons
ADDM takes a significant amount of time to complete
Workarounds
Complications
Timing information for each individual table flush
Known Issues
Community Discussions
References
APPLIES TO:
Oracle Cloud Infrastructure - Database Service - Version N/A and later
Oracle Database Exadata Express Cloud Service - Version N/A and later
Oracle Database Backup Service - Version N/A and later
Oracle Database Cloud Exadata Service - Version N/A and later
Oracle Database Cloud Schema Service - Version N/A and later
Information in this document applies to any platform.
Oracle RDBMS Automatic Workload Repository
PURPOSE
The purpose of this document is to describe the diagnostic data that can be collected and the corrective steps that can be taken when encountering issues with slow, hung or missing AWR snapshots.
AWR snapshots may fail to create for a number of reasons, the main two being:
Snapshot creation does not start
Snapshots creation starts but does not complete
TROUBLESHOOTING STEPS
Detecting Missing snapshots
If AWR snapshots aren't created as expected, check if DBA_HIST_SNAPSHOT shows snapshots in the expected time range:
SQL>
col instance_number format 999 heading 'Inst'
col startup_time format a15 heading 'Startup'
col begin_interval_time format a15 heading 'Begin snap'
col end_interval_time format a15 heading 'End Snap'
col flush_elapsed format a20 heading 'flush elapsed'
col error_count format 9999 heading 'Err#'
Snapshots creation does not start
There are a number of reasons for this including:
Database settings and status
MMON activity becoming suspended (including snapshot generation so snapshot generation does not start).
MMON can be suspended because of:
General issue that prevent all snapshot generation. Check if manual generation works and address any errors if it doesn't
Excessive SYSAUX space usage
MMON related errors such as ORA-12751 causing MMON activity to become suspended. For some of the potential reasons, see:
Document 761298.1 Troubleshooting ORA-12751 "cpu time or run time policy violation" Errors
Start out by verifying if snapshots are actually starting. You need to know the time of the last snapshot and the configured snapshot interval. You can find this with the following SQL:
col systimestamp form a35
col most_recent_snap_time form a25
col snap_interval form a17
SQL> select systimestamp, most_recent_snap_time, snap_interval from wrm$_wr_control where dbid = (select dbid from v$database);
SYSTIMESTAMP MOST_RECENT_SNAP_TIME SNAP_INTERVAL
----------------------------------- ------------------------- --------------
28-DEC-09 01.15.25.098600 PM -08:00 24-DEC-09 01.00.13.418 PM 01:00:00.0
Snapshots should happen every hour (+0 01:00:00). AWR tries to take the snapshots at the top of the hour (08:00, 09:00, 10:00, ...). That means that the first snapshot after startup, or the first automatic snapshot following a manual snapshot, may appear to be "late" or "early". Common reasons snapshots might not be collected are given below.
Database Settings and Status
STATISTICS_LEVEL setting
If the parameter STATISTICS_LEVEL is set to BASIC then snapshots will not be taken automatically.
SQL> show parameter statistic
NAME TYPE VALUE
------------------------------------ ----------- -------------------
optimizer_use_pending_statistics boolean FALSE
statistics_level string TYPICAL
timed_os_statistics integer 0
timed_statistics boolean TRUE
Parameter with "SWRF", "AWR" or "SYSAUX" in them could also have an impact. If any of those are set find out why. If they're no longer required try to unset them.
Database is Closed
In the case shown here:
SYSTIMESTAMP MOST_RECENT_SNAP_TIME SNAP_INTERVAL
----------------------------------- ------------------------- --------------
28-DEC-09 01.15.25.098600 PM -08:00 24-DEC-09 01.00.13.418 PM 01:00:00.0
the last time a snapshot was taken was 24th Dec 01:00 PM but the current date is 28th December. This means that the last snapshot was taken over 4 days ago. This happened because the database was just opened having being been closed for several days:
SQL> alter session set nls_date_format = 'DD-MON-YYYY HH24:MI:SS';
Session altered.
SQL> select open_time from v$thread ;
OPEN_TIME
--------------------
28-DEC-2009 13:11:46
The instance alert log will confirm this.
Database Open Mode
Some open modes may also disable automatic snapshots. For example, in restricted mode or read only mode, snapshots are disabled. Check the database open mode:
SQL> select open_mode from v$database;
OPEN_MODE
----------
READ WRITE
If the above isn't sufficient to resolve the issue, customers may need to gather traces and send to Oracle support. Often, it's best to get the MMON action trace ("28" below) and the snapshot flush trace ("10" below).
Note: On RAC, please run the commands on each instance
Gathering MMON action trace and snapshot flush trace:
SQL> alter session set "_swrf_test_action" = 28;
Session altered.
SQL> alter session set "_swrf_test_action" = 10;
Session altered.
Wait for at least an hour, and then gather the MMON and M00x traces. To disable the traces again, use the following syntax:
SQL> alter session set "_swrf_test_action" = 29;
Session altered.
SQL> alter session set "_swrf_test_action" = 11;
Session altered.
With the tracing enabled the MMON trace file shows the start time and completion messages for each auto flush and purge:
KEWRAFM: Beginning one MMON Auto-Flush cycle ...
Finished one MMON Auto-Flush cycle.
*** 2011-02-17 06:28:37.057
KEWRAFM: Beginning one MMON Auto-Flush cycle ...
Finished one MMON Auto-Flush cycle.
*** 2011-02-17 06:29:22.192
KEWRAPM: Beginning one MMON Auto-Purge cycle ...
KEWRAPM: Finished one MMON Auto-Purge cycle.
The MMON worker file trace file output will show the begin and end times of each step of the AWR statistics collection:
*** MODULE NAME:(MMON_SLAVE) 2011-02-17 07:00:16.734
*** ACTION NAME:(Auto-Flush Slave Action) 2011-02-17 07:00:16.734
Snapshots creation starts but does not complete
General Reasons
One common source of confusion is that the view DBA_HIST_SNAPSHOT filters out snapshots that didn't complete successfully.
That means that DBA_HIST_SNAPSHOT is not a reliable source to investigate this problem. Instead we can refer to the information in WRM$_SNAPSHOT. For instance, let's look at the last 10 snapshots in the database. First, please run this query to verify whether WRM$_SNAPSHOT_DETAILS in the particular release or not:
select owner, object_name, object_type from dba_objects where object_name LIKE UPPER('WRM$%')
and object_type='TABLE';
WRM$_SNAPSHOT_DETAILS does not exist in 11.1.0.7 database but exists in 11.2.0.3.
If this object do not exist in certain release, the SQLs will generate error when executed.
col instance_number form 999 head INST
col begin_interval_time form a25
col flush_elapsed form a17
col status form 999
col error_count form 999 head ERR
col snap_flag form 999 head SNAP
SQL> select * from
(select snap_id,
instance_number,
begin_interval_time,
flush_elapsed,
status,
error_count,
snap_flag
from wrm$_snapshot
where dbid = (select dbid from v$database)
order by snap_id desc)
where rownum <= 10
order by snap_id
/
SNAP_ID INST BEGIN_INTERVAL_TIME FLUSH_ELAPSED STATUS ERR SNAP
---------- ---- ------------------------- ----------------- ------ ---- ---
91 1 29-DEC-09 11.00.35.080 AM 00:00:01.2 0 0 0
92 1 29-DEC-09 12.00.55.772 PM 00:00:27.3 0 0 0
93 1 29-DEC-09 02.45.44.000 PM 00:00:06.1 0 0 1
94 1 29-DEC-09 02.50.33.906 PM 00:00:11.7 0 0 0
95 1 29-DEC-09 03.00.46.341 PM 00:00:01.4 0 0 0
96 1 29-DEC-09 03.10.49.601 PM 00:00:00.9 0 0 0
97 1 29-DEC-09 03.20.53.111 PM 00:00:00.5 0 1 0
98 1 29-DEC-09 03.30.56.494 PM 00:00:00.1 0 0 0
99 1 29-DEC-09 03.41.00.179 PM 00:00:00.2 0 1 0
100 1 29-DEC-09 03.50.04.773 PM 00:00:02.1 0 1 0
In this case, we can see that a few of the snapshots had errors (ERROR_COUNT > 0), but they still completed (STATUS = 0 and FLUSH_ELAPSED is not NULL).
To investigate the errors, start with the WRM$_SNAP_ERROR table:
SQL> select * from wrm$_snap_error
where dbid = (select dbid from v$database)
order by snap_id;
SNAP_ID DBID INST TABLE_NAME ERROR_NUMBER
---------- ---------- ----- ------------------------------ ------------
97 581521084 1 WRH$_SQL_PLAN 942
99 581521084 1 WRH$_SQL_PLAN 942
100 581521084 1 WRH$_SQL_PLAN 942
In most cases, the M00x process will write some information about the error to a file as well. It will be either an incident dump, or a regular trace file. From 11g onwards incident dumps should always be collected for service requests where available. If there are no errors, but status is non-zero, it means that the snapshot ran for too long and was terminated, or that it's still running. In those cases, it's good to trace the snapshots to get some timing information.
If this happens with automatic snapshots, you may find that manual snapshots actually complete but take a long period of time to do so. Tracing with normal SQL trace will tend to just show high CPU usage:
begin
dbms_workload_repository.create_snapshot();
end;
call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 1 485.98 485.98 0 1 34 1
Fetch 0 0.00 0.00 0 0 0 0
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 2 485.98 485.98 0 1 34 1
This is not particularly useful in terms of diagnostics. You need to get further into the processing so as to determine the source of the issue. The easiest way to trace the snapshots is the same one shown above:
Note: On RAC, please run the command on each instance
SQL> alter session set "_swrf_test_action" = 10;
Session altered.
This will write some information (about 120 rows per snapshot) into the M00x process trace, including some timing information.
The output willl be similar to the following:
...
[Top SQL Selection], btime: 1961091529, etime:1961095597, Elapsed=4068
[WRH$_SQL_BIND_METADATA Attr Selection], btime: 1961095597, etime:1961095598, Elapsed=1
[WRH$_SQL_BIND_METADATA Update], btime: 1961095598, etime:1961095598, Elapsed=0
[WRH$_SQL_BIND_METADATA Insert], btime: 1961095598, etime:1961095599, Elapsed=1
[WRH$_SQL_BIND_METADATA Total], btime: 1961095597, etime:1961095599, Elapsed=2
[WRH$_SQL_PLAN Attr Selection], btime: 1961095599, etime:1961095601, Elapsed=2
[WRH$_SQL_PLAN Update], btime: 1961095601, etime:1961095602, Elapsed=1
[WRH$_SQL_PLAN Insert], btime: 1961095602, etime:1961095604, Elapsed=2
[WRH$_SQL_PLAN Total], btime: 1961095599, etime:1961095604, Elapsed=5
[WRH$_OPTIMIZER_ENV Attr Selection], btime: 1961095604, etime:1961095604, Elapsed=0
[WRH$_OPTIMIZER_ENV Update], btime: 1961095604, etime:1961095604, Elapsed=0
[WRH$_OPTIMIZER_ENV Insert], btime: 1961095604, etime:1961095604, Elapsed=0
[WRH$_OPTIMIZER_ENV Total], btime: 1961095604, etime:1961095604, Elapsed=0
[WRH$_SQLTEXT Attr Selection], btime: 1961095604, etime:1961095605, Elapsed=1
[WRH$_SQLTEXT Update], btime: 1961095605, etime:1961095605, Elapsed=0
[WRH$_SQLTEXT Insert], btime: 1961095605, etime:1961095605, Elapsed=0
[WRH$_SQLTEXT Total], btime: 1961095604, etime:1961095605, Elapsed=1
[WRH$_SQLSTAT All Stats], btime: 1961095605, etime:1961095618, Elapsed=13
[WRH$_SQLSTAT Total], btime: 1961095605, etime:1961095618, Elapsed=13
[WRH$_SQL_SUMMARY Total], btime: 1961095618, etime:1961095618, Elapsed=0
[SQL Group Total], btime: 1961091529, etime:1961095618, Elapsed=4089
[Snapshot Total], btime: 1961089445, etime:1961095618, Elapsed=6173
To interpret, look at the [Snapshot Total] at the end of the report and then see which like makes up the largest proportion. In this case the [Top SQL Selection] section is using the time:
[Top SQL Selection], btime: 1961091529, etime:1961095597, Elapsed=4068
[WRH$_SQL_BIND_METADATA Attr Selection], btime: 1961095597, etime:1961095598, Elapsed=1
...
[WRH$_SQL_SUMMARY Total], btime: 1961095618, etime:1961095618, Elapsed=0
[SQL Group Total], btime: 1961091529, etime:1961095618, Elapsed=4089
[Snapshot Total], btime: 1961089445, etime:1961095618, Elapsed=6173
You can use this information to look for similar bugs or other occurrences.
If more details are required, then SQL tracing can be used to collect more information. There are two ways of getting SQL trace for snapshots. The best way is usually to set module/action-specific tracing for the two flush actions:
begin
dbms_monitor.serv_mod_act_trace_enable(service_name=>'SYS$BACKGROUND',
module_name=>'MMON_SLAVE',
action_name=>'Auto-Flush Slave Action');
dbms_monitor.serv_mod_act_trace_enable(service_name=>'SYS$BACKGROUND',
module_name=>'MMON_SLAVE',
action_name=>'Remote-Flush Slave Action');
end;
/
To disable the tracing later, use "dbms_monitor.serv_mod_act_trace_disable" instead.
An alternative way is to use an AWR test event:
alter session set events 'immediate trace name awr_test level 3';
To disable that form of tracing later, use "awr_test level 4" instead.
For completeness, there are two other forms of tracing that can also be interesting.
# trace AWR snapshots
alter session set events 'immediate trace name awr_test level 1';
# trace AWR purging
alter session set events 'immediate trace name awr_test level 2';
# trace AWR SQL
alter session set events 'immediate trace name awr_test level 3';
# turn off all of the above, if set
alter session set events 'immediate trace name awr_test level 4';
Note: These commands will take effect for the entire instance, even though the syntax used is "alter session".
ADDM takes a significant amount of time to complete
There's one additional scenario where snapshots don't finish properly. At the end of the snapshot itself, ADDM (the Automatic Database Diagnostic Monitor) is started to analyze ASH data to identify potential performance issues. It is possible for this to take a significant amount of time to complete, preventing the snapshot from completing in a timely manner. To determine the cause gather SQL trace for the process using the 10046 trace event, and run tkprof to see if there are any unusual amounts of waits or any plans that look bad. See:
Document 376442.1 How To Collect 10046 Trace (sql_trace) Diagnostics for Performance Issues
Workarounds
When you review the trace information, you might find a single table that seems to be causing a problem. If so, you can try to disable the flushing of that table, so that you can verify if snapshots work OK after that. Here's an example of how to disable flush of the table WRH$_TEMPSTATXS. Please note that this command should be run on all instances in the cluster.
alter system set "_awr_disabled_flush_tables" = 'wrh$_tempstatxs';
Complications
In some cases, the SQL trace will show a lot of 'reliable message' waits. That just means that the local node (the one that started the snapshot) is waiting for the remote nodes to complete. The real issue at that point will be in the trace files from the remote nodes.
In some cases, all local and remote nodes will have completed their snapshots successfully, but the local node doesn't realize that, and keeps waiting for 'reliable message' until it hits the 15-minute timeout. This is believed to be caused by a bug, but it has not yet been resolved.
If this is the case, each node will have a successfully completed snapshot (STATUS=0) with an elapsed time less than the timeout (FLUSH_ELAPSED < 00:15:00.00).
Timing information for each individual table flush
Release 11.2.0.2 introduces a new table, WRM$_SNAPSHOT_DETAILS, which contains timing information for each individual table flush.
This can be used to determine which table is taking a lot of time. For instance, sometimes a single table takes close to 15 minutes, causing the entire flush to fail.
set pagesize 999
column name format a28
column time format a29
variable dbid number
exec select dbid into :dbid from v$database;
PL/SQL procedure successfully completed.
SQL> variable snap_id number
SQL> exec select max(snap_id) into :snap_id from wrm$_snapshot where dbid=:dbid;
PL/SQL procedure successfully completed.
SQL> select table_name_kewrtb name, end_time-begin_time time
2 from wrm$_snapshot_details, x$kewrtb
3 where snap_id = :snap_id
4 and dbid = :dbid
5 and table_id = table_id_kewrtb
6 order by table_id;
In this case, WRH$_PARAMETER_NAME took a lot more time than the other tables, but still not enough to be a problem.
Please note that WRH$_SQLTEXT was flushed twice. That's expected behaviour.
Sometimes there may be encountered:
KEWROCISTMTEXEC - encountered error: (ORA-00054: resource busy and acquire with NOWAIT specified or timeout expired
This query identified it:
select inst_id,sid,serial#,status,state,event,sql_id,program,sql_exec_start,module,action from gv$session where program like '%M00%' order by inst_id;
7 3282 2154 ACTIVE WAITING direct path write 2g9g42n7afpfs oracle@<host> (M003) 2020-04-13 01:36 MMON_SLAVE ASH Emergency-Flush
Killing the m003 worker resolved the issue and auto snapshots resumed.
Known Issues
作者: 郑全 时间: 2023-10-31 18:07
Document 1392603.1 AWR or STATSPACK Snapshot collection extremely slow in 11gR2
Document 1466333.1 AWR Snapshots Missing Intermittently Accompanied by Spike in CPU from MMON Process
Document 2023652.1 AWR Snapshots Are Not Being Created Because MMON Is Not Being Respawned
Document 2043531.1 AWR Snapshots Fail to Generate With Associated "Suspending MMON action '%s' for 82800 seconds" alert log messages and Underlying ORA-12751 Errors
Document 1999635.1 AWR/ASH/ADDM Reports are not Generated and may Raise ORA-13707 After Changing control_management_pack_access from NONE to DIAGNOSTIC+TUNING Without Restarting the Database
Document 18148383.8 Bug 18148383 - AWR snapshots stopped , MMON hung with "EMON to process ntnfs"
Document 1668096.1 AWR Snapshots Failing with ORA-00001: unique constraint (SYS.wrh$_rule_set_pk) violated