该等待事件的 %DB Time 占比高于 10%
成为 Top 1 或 Top 2 等待事件
需要准确定位阻塞源(X 锁持有方),最大限度避免该类等待事件反复出现
二、问题现象确认
2.1 检查 DB Time 确认负载异常时段
小贴士: 执行前需确认 AWR 快照已开启,且问题时段有快照覆盖。可通过 SELECT snap_id, begin_interval_time FROM dba_hist_snapshot ORDER BY snap_id 确认。
col awrrpt_name for a56
col dtpc for 999999.99
col date_time for a18
WITH timemodel_tmp
AS (SELECT sn.begin_interval_time, sn.end_interval_time, ss.stat_name, cp.pdb_name,
os.value cpu_cnt, ss.VALUE e_value, ss.snap_id e_snap_id, i.instance_number,
LAG (ss.VALUE, 1) OVER (partition by ss.instance_number, ss.stat_name ORDER BY ss.snap_id) b_value,
LAG (ss.snap_id, 1) OVER (partition by ss.instance_number, ss.stat_name ORDER BY ss.snap_id) b_snap_id
FROM dba_hist_con_sys_time_model ss, dba_hist_snapshot sn, DBA_HIST_OSSTAT os,
(select dbid, instance_number, row_number()over(partition by dbid, instance_number order by STARTUP_TIME desc) rn from DBA_HIST_DATABASE_INSTANCE) i
, cdb_pdbs cp
WHERE sn.begin_interval_time between to_date('&begin_time','yyyy-mm-dd hh24:mi') and to_date('&end_time','yyyy-mm-dd hh24:mi')
AND ss.snap_id = sn.snap_id AND ss.dbid = sn.dbid
AND ss.snap_id = os.snap_id AND ss.dbid = os.dbid
AND ss.instance_number = sn.instance_number AND ss.instance_number = os.instance_number
AND ss.dbid = (SELECT dbid FROM v$database)
AND ss.instance_number = i.instance_number AND ss.dbid = i.dbid AND i.rn=1
AND ss.stat_name in ( 'DB time', 'DB CPU') AND os.stat_name = 'NUM_CPUS'),
timemodel as (select begin_interval_time, end_interval_time, pdb_name, cpu_cnt, instance_number,
max(case stat_name when 'DB time' then e_value end) dt_e_value, max(case stat_name when 'DB CPU' then e_value end) ct_e_value,
max(case stat_name when 'DB time' then e_snap_id end) e_snap_id,
max(case stat_name when 'DB time' then b_value end) dt_b_value, max(case stat_name when 'DB CPU' then b_value end) ct_b_value,
max(case stat_name when 'DB time' then b_snap_id end) b_snap_id
from timemodel_tmp
group by begin_interval_time, end_interval_time, pdb_name, cpu_cnt, instance_number),
candi_tmp as (
SELECT TO_CHAR (BEGIN_INTERVAL_TIME, 'mm-dd hh24:mi')
|| TO_CHAR (END_INTERVAL_TIME, '~hh24:mi') date_time,
ROUND ((dt_e_value - NVL (dt_b_value, 0))/1000000
/ ((cast (end_interval_time as date) - cast(begin_interval_time as date))*24*60*60), 2) db_times,
ROUND ((dt_e_value - NVL (dt_b_value, 0))/1000000
/ ((cast (end_interval_time as date) - cast(begin_interval_time as date))*24*60*60) / cpu_cnt, 2) db_times_per_cpu,
ROUND ((ct_e_value - NVL (ct_b_value, 0))/1000000
/ ((cast (end_interval_time as date) - cast(begin_interval_time as date))*24*60*60), 2) cpu_times,
ROUND ((ct_e_value - NVL (ct_b_value, 0))/1000000
/ ((cast (end_interval_time as date) - cast(begin_interval_time as date))*24*60*60) / cpu_cnt, 2) cpu_times_per_cpu,
b_snap_id, e_snap_id, BEGIN_INTERVAL_TIME, END_INTERVAL_TIME, pdb_name, instance_number
FROM timemodel
WHERE (dt_e_value - NVL (dt_b_value, 0)) > 0 AND NVL (dt_b_value, 0) > 0),
candi as (select t.*, row_number()over(partition by instance_number order by BEGIN_INTERVAL_TIME desc) rn
from candi_tmp t)
select date_time, instance_number inst, db_times dt, db_times_per_cpu dtpc, cpu_times ct, cpu_times_per_cpu ctpc, b_snap_id, e_snap_id,
'awr_'||pdb_name||instance_number||'_'||TO_CHAR (BEGIN_INTERVAL_TIME, 'yyyy-mm-dd-hh24mi')||TO_CHAR (END_INTERVAL_TIME, '~hh24mi')||'_'||to_char(db_times)||'_'||to_char(cpu_times)||'.html' awrrpt_name
from candi c
order by BEGIN_INTERVAL_TIME, instance_number;
注意: 该查询使用了 cdb_pdbs 视图,仅适用于 Oracle 12c 及以上版本的 CDB 架构。非 CDB 环境需移除 cdb_pdbs cp 相关的 JOIN 和 pdb_name 字段。
2.2 确认问题时段等待事件及对应 SQL
ASH 区间等待排名
col event for a35
col MACHINE for a35
col program for a40
select *
from (select /*+ parallel(a,4) */
a.event, a.machine, a.program, a.sql_id, count(*) as ev_cnt
from dba_hist_active_sess_history a
where a.sample_time between
to_date('&begin_time', 'yyyy-mm-dd hh24:mi') and
to_date('&end_time', 'yyyy-mm-dd hh24:mi')
and event != 'SQL*Net message from dblink'
group by a.event, a.machine, a.program, a.sql_id
having count(*) > 200)
order by ev_cnt desc, machine, sql_id;
小贴士: having count(*) > 200 是过滤低频事件的阈值,可根据实际情况调整。如果问题时段较短,建议降低阈值(如 > 50)以避免遗漏。
确认问题时段的等待事件
select a.event,
count(*) cnt,
round(avg(a.time_waited) / 1000, 0) avg_wait_ms,
round(max(a.time_waited) / 1000, 0) max_wait_ms
from gv$active_session_history a
where a.sample_time between
to_date('&begin_time', 'yyyy-mm-dd hh24:mi') and
to_date('&end_time', 'yyyy-mm-dd hh24:mi')
group by event
order by cnt desc;
注意: gv$active_session_history 仅保存在内存中,默认保留最近 1 小时数据。如果问题时段已过去,请使用 dba_hist_active_sess_history(基于 AWR 快照)。
等待事件对应 SQL、阻塞源会话查询
set lines 300 pages 999
col module for a20
col event for a25
col P1TEXT for a25
col P2TEXT for a25
col P3TEXT for a25
select INST_ID, sql_id, module, event, P1TEXT, P1, P2TEXT, P2, P3TEXT, P3, BLOCKING_INSTANCE, BLOCKING_SESSION, count(*) cnt
from gv$active_session_history
where sample_time between
to_date('&begin_time', 'yyyy-mm-dd hh24:mi') and
to_date('&end_time', 'yyyy-mm-dd hh24:mi')
and event = '&event'
group by INST_ID, sql_id, module, event, P1TEXT, P1, P2TEXT, P2, P3TEXT, P3, BLOCKING_INSTANCE, BLOCKING_SESSION
order by cnt desc;
P1/P2/P3 参数含义(library cache: mutex X):
步骤 1:查找等待会话的 SADDR
select sid, saddr from v$session where event = 'library cache lock';
步骤 2:查找等待会话请求的锁及对象
select kgllkhdl Handle, kgllkreq Request, kglnaobj Object
from x$kgllk
where kgllkses = '&saddr' -- 替换为步骤1查到的SADDR
and kgllkreq > 0;
输出说明:
列名 含义
HANDLE 对象的 handle 地址,与 Systemstate 中的 handle address 对应
REQUEST 请求的锁模式(>0 表示正在请求,2=S,3=X)
Object 对象名称的前 80 个字符
步骤 3:查找阻塞源会话
select kgllkses saddr, kgllkhdl handle, kgllkmod mod, kglnaobj object
from x$kgllk lock_a
where kgllkmod > 0
and exists (select lock_b.kgllkhdl
from x$kgllk lock_b
where kgllkses = '&blocked_saddr' -- 替换为被阻塞会话的SADDR
and lock_a.kgllkhdl = lock_b.kgllkhdl
and kgllkreq > 0);
kgllkmod 值含义: 1=NULL,2=S(共享),3=X(独占)
步骤 4:查询阻塞源会话详细信息
select sid, username, terminal, program
from v$session
where saddr = '&blocking_saddr'; -- 替换为步骤3查到的阻塞源SADDR
步骤 5(可选):查找被该阻塞源阻塞的所有会话
select sid, username, terminal, program
from v$session
where saddr in
(select kgllkses
from x$kgllk lock_a
where kgllkreq > 0
and exists (select lock_b.kgllkhdl
from x$kgllk lock_b
where kgllkses = '&blocking_saddr' -- 阻塞源的SADDR
and lock_a.kgllkhdl = lock_b.kgllkhdl
and kgllkreq = 0));
方法二:DIA0 Trace 文件分析(适用于 RAC 和单机)
小贴士: 此方法无需手动执行 dump,直接读取 DIA0 进程自动生成的 Trace 文件,资源消耗较低。
-- 查找 DIA0 进程的 Trace 文件路径
select value from v$diag_info where name = 'Diag Trace';
步骤 2:在 Trace 文件中搜索阻塞链
搜索关键字:Chains most likely to have caused the hang
步骤 6:根据 PID 查询阻塞源会话信息
alter session set nls_date_format = 'yyyy-mm-dd hh24:mi:ss';
set lines 1000
col machine for a25
col username for a10
col program for a18
col kill_spid for a30
col event for a40
col sql_id for a15
col wait_class for a15
select a.inst_id,
a.sid,
a.machine,
a.username,
a.program,
a.status,
a.logon_time,
a.sql_exec_start,
a.prev_sql_id,
a.sql_id,
a.state,
(case
when a.state = 'WAITING' then
a.seconds_in_wait
when a.state = 'WAITED KNOWN TIME' then
a.wait_time
else
NULL
end) wait_time_sec,
a.blocking_instance,
a.blocking_session,
a.wait_class,
a.event,
a.last_call_et,
'kill -9 ' || spid as kill_spid
from gv$session a, gv$process b
where a.inst_id = b.inst_id
and a.paddr = b.addr
and b.pid = &pid;
注意: kill -9 为 Linux 命令,Windows 环境下应使用 orakill [ORACLE_SID] [SPID]。
步骤 1:获取阻塞会话的 OS PID
select spid from v$process where addr =
(select paddr from v$session where sid = &blocking_sid);
步骤 2:生成 Errorstack
$ sqlplus '/ as sysdba'
oradebug setospid &spid -- 替换为步骤1查到的OS PID
oradebug dump errorstack 3
-- 等待 1 分钟
oradebug dump errorstack 3
-- 等待 1 分钟
oradebug dump errorstack 3
exit
步骤 3:分析 Trace 文件
生成的 Trace 文件的 stack 可用于匹配已知问题。系统状态和 errorstacks 不容易理解,可能需要开 SR 进行 Trace 分析。
四、library cache: mutex X 热对象检查
4.1 获取 Top 10 mutex X 等待的 P1 参数值
select *
from (select p1,
sql_id,
count(*),
(ratio_to_report(count(*)) over()) * 100 pct
from dba_hist_active_sess_history
where event = 'library cache: mutex X'
and snap_id between &begin_snapid and &end_snapid
and dbid = &dbid
group by p1, sql_id
order by count(*) desc)
where rownum <= 10;
4.2 根据 P1 值检查 Mutex 针对的对象
判断规则(参考 MOS Doc ID 1357946.1):