重庆思庄Oracle、KingBase、PostgreSQL、Redhat认证学习论坛

 找回密码
 注册

QQ登录

只需一步,快速开始

搜索
查看: 68|回复: 0
打印 上一主题 下一主题

PostgreSQL DBA 巡检必备:68 个常用 SQL 脚本

[复制链接]
跳转到指定楼层
楼主
发表于 2026-5-17 21:18:25 | 只看该作者 回帖奖励 |正序浏览 |阅读模式
前言
作为 PostgreSQL DBA,日常巡检是保障数据库稳定运行的核心工作之一。本文参考开源巡检脚本 pg_health_check.sh v1.0,将其内嵌的 68 个 SQL 语句按功能模块系统化整理,每个 SQL 均附带详细注释说明、输出字段解读及告警阈值建议。

全文共 16 个巡检模块,覆盖系统信息、基础信息、空间使用、连接与会话、性能采样、共享缓冲区、VACUUM 与膨胀、事务与年龄、WAL 与归档、锁等待、索引分析、用户权限、安全配置、错误日志、备份策略、健康总结等维度。


一、PostgreSQL 基础信息采集
1.1 查询数据库版本
-- 输出版本号、编译选项、操作系统信息等完整字符串
SELECT version();
输出示例:

PostgreSQL 14.12 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.5.0 20210514, 64-bit
1.2 查询实例唯一标识
-- system_identifier 是初始化时生成的全局唯一标识
-- 用于确认主备是否来自同一实例、恢复后标识是否变化
SELECT system_identifier FROM pg_control_system();
小贴士: 该值在 pg_control 文件中,实例整个生命周期不变,可用于校验 PITR 恢复后的实例身份。

1.3 数据库列表及大小
-- datname: 数据库名
-- size: 可读格式的大小
-- conn_limit: 连接数上限 (-1 表示无限制)
-- age: 事务 ID 年龄(越大越接近回卷风险)
-- age_remain: 距离 2^31 上限的剩余空间
SELECT datname AS database_name,
    pg_size_pretty(pg_database_size(oid)) AS size,
    datconnlimit AS conn_limit,
    age(datfrozenxid) AS age,
    2^31 - age(datfrozenxid) AS age_remain
FROM pg_database
ORDER BY pg_database_size(oid) DESC;
关键指标说明:

字段        含义        告警阈值
age        事务年龄        > 2 亿 (2×10⁸) 需关注
age_remain        剩余可用事务 ID        < 5 亿需紧急处理
1.4 已安装扩展
-- 查看所有非系统内置扩展及其所属数据库和版本
SELECT d.datname AS database_name,
    e.extname AS extension_name,
    e.extversion AS version
FROM pg_extension e
JOIN pg_database d ON true
WHERE e.extname NOT IN ('pg_catalog', 'information_schema')
ORDER BY d.datname, e.extname;
1.5 用户定义的表空间
-- spcname: 表空间名称
-- location: 文件系统路径(NULL 表示默认表空间)
-- size: 表空间总占用大小
SELECT spcname AS tablespace_name,
    pg_tablespace_location(oid) AS location,
    pg_size_pretty(pg_tablespace_size(oid)) AS size
FROM pg_tablespace
ORDER BY pg_tablespace_size(oid) DESC;
1.6 用户修改过的参数(非默认值)
-- 筛选 source 不为 default 和 override 的参数
-- 即用户通过 postgresql.conf、ALTER SYSTEM 或 ALTER SET 修改过的参数
SELECT name AS parameter,
    setting AS current_value,
    unit,
    boot_val AS default_value,
    source
FROM pg_settings
WHERE source NOT IN ('default', 'override')
ORDER BY name;
DBA 必看: 这是快速了解实例有哪些自定义配置的最佳方式,建议每次巡检必查。

1.7 数据库级或角色级定制参数
-- 通过 ALTER DATABASE ... SET 或 ALTER ROLE ... SET 设置的参数
-- 这些参数仅对特定数据库或角色生效
SELECT d.datname AS database_name,
    r.rolname AS role_name,
    p.setconfig AS settings
FROM pg_db_role_setting p
LEFT JOIN pg_database d ON p.setdatabase = d.oid
LEFT JOIN pg_roles r ON p.setrole = r.oid;
1.8 对象统计概览
-- 统计用户创建的各类对象数量
-- relkind: r=普通表, v=视图, i=索引, S=序列, m=物化视图
SELECT '普通表' AS type, COUNT(*) AS cnt
FROM pg_class WHERE relkind = 'r' AND relnamespace > 11
UNION ALL
SELECT '视图', COUNT(*) FROM pg_class WHERE relkind = 'v' AND relnamespace > 11
UNION ALL
SELECT '索引', COUNT(*) FROM pg_class WHERE relkind = 'i' AND relnamespace > 11
UNION ALL
SELECT '序列', COUNT(*) FROM pg_class WHERE relkind = 'S' AND relnamespace > 11
UNION ALL
SELECT '物化视图', COUNT(*) FROM pg_class WHERE relkind = 'm' AND relnamespace > 11
UNION ALL
SELECT '触发器', COUNT(*) FROM pg_trigger WHERE tgisinternal = false;
二、空间使用分析
2.1 表空间大小排名
-- 查看所有表空间的磁盘占用情况
SELECT spcname AS tablespace,
    pg_size_pretty(pg_tablespace_size(oid)) AS total_size,
    CASE WHEN pg_tablespace_location(oid) IS NULL
         THEN 'default location' ELSE pg_tablespace_location(oid)
    END AS location
FROM pg_tablespace
ORDER BY pg_tablespace_size(oid) DESC;
2.2 数据库大小排名
-- 排除模板库,按大小降序排列
SELECT datname AS database_name,
    pg_size_pretty(pg_database_size(oid)) AS size,
    pg_database_size(oid) AS size_bytes
FROM pg_database
WHERE datname NOT IN ('template0', 'template1')
ORDER BY size_bytes DESC;
2.3 TOP 20 大表
-- 排除系统 schema,返回表大小和估算行数
-- reltuples 是优化器统计值,可能与实际行数有偏差(需 ANALYZE 后准确)
SELECT
    n.nspname AS schema_name,
    c.relname AS table_name,
    pg_size_pretty(pg_relation_size(c.oid)) AS table_size,
    pg_relation_size(c.oid) AS size_bytes,
    c.reltuples::bigint AS estimated_rows
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE c.relkind = 'r'
  AND n.nspname NOT IN ('pg_catalog', 'information_schema', 'pg_toast')
ORDER BY pg_relation_size(c.oid) DESC
LIMIT 20;
2.4 TOP 20 大索引
-- 返回索引所属表、索引名、大小等信息
-- 通过 pg_index 关联索引与基表的关系
SELECT
    n.nspname AS schema_name,
    c.relname AS table_name,
    i.relname AS index_name,
    pg_size_pretty(pg_relation_size(i.oid)) AS index_size,
    pg_relation_size(i.oid) AS size_bytes
FROM pg_class i
JOIN pg_namespace n ON n.oid = i.relnamespace
JOIN pg_index ix ON ix.indexrelid = i.oid
JOIN pg_class c ON c.oid = ix.indrelid
WHERE i.relkind = 'i'
  AND n.nspname NOT IN ('pg_catalog', 'information_schema', 'pg_toast')
ORDER BY pg_relation_size(i.oid) DESC
LIMIT 20;
2.5 表膨胀率概览(索引/表大小比)
-- index_to_table_ratio 反映索引占表的比例
-- 比例过高可能存在冗余索引,过低可能缺少必要索引
-- 仅统计大于 1MB 的表
SELECT
    n.nspname AS schema_name,
    c.relname AS table_name,
    pg_size_pretty(pg_relation_size(c.oid)) AS current_size,
    ROUND(100.0 * pg_relation_size(c.oid) /
        NULLIF(pg_total_relation_size(c.oid) - pg_relation_size(c.oid), 0), 2) AS index_to_table_ratio
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE c.relkind = 'r'
  AND n.nspname NOT IN ('pg_catalog', 'information_schema', 'pg_toast')
  AND pg_relation_size(c.oid) > 1024 * 1024
ORDER BY pg_relation_size(c.oid) DESC
LIMIT 20;
三、连接与会话分析
3.1 当前连接状态汇总
-- state 可能值: active, idle, idle in transaction,
--   idle in transaction (aborted), fastpath function call, disabled
SELECT
    COALESCE(state, 'unknown') AS state,
    COUNT(*) AS connection_count
FROM pg_stat_activity
GROUP BY state
ORDER BY connection_count DESC;
状态含义速查:

state        含义        是否需要关注
active        正在执行查询        正常
idle        空闲等待新命令        正常
idle in transaction        事务中空闲        ⚠️ 可能持有锁
idle in transaction (aborted)        事务中出错未回滚        &#128308; 需立即处理
3.2 连接详情(活跃前 20,按等待时间排序)
-- seconds_in_state: 在当前状态持续的时间(秒)
-- current_query: 截取前 100 字符的当前 SQL
SELECT
    pid,
    usename AS username,
    datname AS database,
    state,
    client_addr,
    application_name,
    EXTRACT(EPOCH FROM (now() - state_change))::int AS seconds_in_state,
    LEFT(REPLACE(query, chr(10), ' '), 100) AS current_query
FROM pg_stat_activity
WHERE state IS NOT NULL
ORDER BY state_change ASC
LIMIT 20;
3.3 连接数统计与余量
-- max_connections: 最大允许连接数
-- reserved_for_superuser: 为超级用户保留的连接槽位
-- available_for_normal: 普通用户剩余可用连接数
SELECT
    MAX(setting)::int AS max_connections,
    (SELECT COUNT(*) FROM pg_stat_activity) AS current_connections,
    (SELECT setting FROM pg_settings WHERE name = 'superuser_reserved_connections')::int AS reserved_for_superuser,
    MAX(setting)::int - (SELECT COUNT(*) FROM pg_stat_activity) AS available_for_normal
FROM pg_settings
WHERE name = 'max_connections';
告警建议: 当 available_for_normal < 10 时应关注,< 0 则意味着普通用户已无法新建连接。

3.4 各用户连接数与限额
-- rolconnlimit: 角色级别的连接数限制 (-1 表示不限制)
-- current_connects: 当前该用户的实际连接数
SELECT
    a.rolname AS username,
    a.rolconnlimit AS conn_limit,
    COALESCE(b.connects, 0) AS current_connects
FROM pg_authid a
LEFT JOIN (
    SELECT usename, COUNT(*) AS connects
    FROM pg_stat_activity
    GROUP BY usename
) b ON a.rolname = b.usename
WHERE a.rolname NOT IN ('postgres')
ORDER BY b.connects DESC NULLS LAST;
3.5 各数据库连接数与限额
-- 类似用户维度,从数据库维度查看连接分布
SELECT
    a.datname AS database_name,
    a.datconnlimit AS conn_limit,
    COALESCE(b.connects, 0) AS current_connects
FROM pg_database a
LEFT JOIN (
    SELECT datname, COUNT(*) AS connects
    FROM pg_stat_activity
    GROUP BY datname
) b ON a.datname = b.datname
WHERE a.datname NOT IN ('template0', 'template1')
ORDER BY b.connects DESC NULLS LAST;
3.6 长时间空闲连接(空闲超 5 分钟)
-- 空闲超过 5 分钟的连接可能是应用连接池泄漏
-- 可考虑配置 idle_in_transaction_session_timeout 自动清理
SELECT
    pid,
    usename AS username,
    datname AS database,
    client_addr,
    application_name,
    EXTRACT(EPOCH FROM (now() - state_change))::int AS idle_seconds
FROM pg_stat_activity
WHERE state = 'idle'
  AND state_change < now() - interval '5 minutes'
ORDER BY state_change ASC
LIMIT 20;
四、性能采样分析(两次快照对比)
原理说明: 本模块通过前后两次采样 pg_stat_database 的差值计算实时 QPS、TPS 和命中率,类似 Oracle AWR 快照机制。

4.1 采样 — 缓冲区块访问总量
-- 总缓冲区访问次数(命中 + 未命中)
-- 需间隔 10~30 秒采集两次,做差计算增量
SELECT blks_hit + blks_read FROM pg_stat_database WHERE datname = current_database();
4.2 采样 — 缓冲区命中次数
-- 仅缓冲区命中次数,与 4.1 配合计算命中率
SELECT blks_hit FROM pg_stat_database WHERE datname = current_database();
4.3 采样 — 元组变更总量
-- 插入 + 更新 + 删除的元组总数(累计值)
SELECT (n_tup_ins + n_tup_upd + n_tup_del) FROM pg_stat_database WHERE datname = current_database();
4.4 采样 — 事务提交回滚总数
-- 提交 + 回滚的事务总数(累计值)
SELECT (xact_commit + xact_rollback) FROM pg_stat_database WHERE datname = current_database();
使用方式: 间隔 10~30 秒后重复执行 4.1 ~ 4.4,通过差值计算:

Buffer Hit Ratio = hit_delta / blks_delta,建议 > 99%
TPS = xact_delta / interval_seconds
Tuple Changes/sec = tup_delta / interval_seconds
4.5 数据库级性能指标汇总
-- 单次查询即可获取当前数据库的关键性能累计指标
-- 注意是累计值,需两次采样做差才有实时意义
SELECT
    datname,
    xact_commit,
    xact_rollback,
    blks_read,
    blks_hit,
    ROUND(100.0 * blks_hit / NULLIF(blks_hit + blks_read, 0), 2) AS cache_hit_ratio,
    tup_returned, tup_fetched, tup_inserted, tup_updated, tup_deleted,
    conflicts,
    deadlocks,
    blk_read_time::int AS read_ms,
    blk_write_time::int AS write_ms,
    temp_files,
    temp_bytes
FROM pg_stat_database
WHERE datname = current_database();
关键指标告警阈值:

指标        建议范围        告警条件
cache_hit_ratio        > 99%        < 95% 需增加 shared_buffers
deadlocks        0        > 0 需排查业务逻辑
conflicts        低        备库冲突过高需关注
temp_files        0        大量临时文件说明 work_mem 不足
五、共享缓冲区与后台写入(BGWriter)
5.1 内存与 IO 相关关键参数
-- 这些参数直接影响内存使用和磁盘写入策略
SELECT name, setting, unit, short_desc
FROM pg_settings
WHERE name IN (
    'shared_buffers',           -- 共享缓冲区大小
    'effective_cache_size',     -- 优化器估计的可用缓存
    'work_mem',                 -- 排序/Hash 操作内存上限
    'maintenance_work_mem',     -- 维护操作(VACUUM/CREATE INDEX)内存
    'temp_buffers',             -- 临时表使用的缓冲区
    'max_connections',          -- 最大连接数
    'bgwriter_delay',           -- BGWriter 轮询间隔(ms)
    'bgwriter_lru_maxpages',    -- BGWriter 每次 LRU 刷写的最大页数
    'bgwriter_lru_multiplier',  -- BGWriter 刷写倍率
    'bgwriter_flush_after',     -- BGWriter 累积刷写字节阈值
    'checkpoint_completion_target', -- 检查点完成目标
    'checkpoint_timeout',       -- 检查点间隔时间
    'min_wal_size',             -- WAL 最小大小
    'max_wal_size',             -- WAL 最大大小
    'wal_buffers'               -- WAL 缓冲区大小
)
ORDER BY name;
参数调优建议:

参数        推荐值        说明
shared_buffers        物理内存的 25%        不超过 40%,Linux 上受 shmmax 限制
effective_cache_size        物理内存的 50%~75%        仅供优化器参考,不实际分配
work_mem        (shared_buffers / max_connections) × 0.25        并发高时需适当降低
maintenance_work_mem        256MB~1GB        VACUUM 和 CREATE INDEX 使用
5.2 BGWriter 统计信息
-- 查看后台写入进程的完整统计信息
-- checkpoints_timed: 计划内检查点次数
-- checkpoints_req:  请求触发的检查点次数(过多说明 WAL 写入过快)
-- buffers_checkpoint: 检查点刷写的缓冲区数
-- buffers_clean: BGWriter 主动刷写的缓冲区数
-- buffers_backend: 后端进程自行刷写的缓冲区数(过高说明 BGWriter 不够积极)
SELECT * FROM pg_stat_bgwriter;
进阶技巧: buffers_backend 过高意味着后端进程经常需要自己刷脏页,说明 BGWriter 不够积极。可适当调低 bgwriter_delay 或调高 bgwriter_lru_maxpages。

5.3 数据库级缓冲区命中统计
-- 逐库查看缓存命中率、死锁、IO 时间等
SELECT
    s.datname AS database,
    s.numbackends AS backends,
    s.xact_commit AS commits,
    s.xact_rollback AS rollbacks,
    s.blks_read,
    s.blks_hit,
    ROUND(100.0 * s.blks_hit / NULLIF(s.blks_hit + s.blks_read, 0), 2) AS cache_hit_ratio,
    s.tup_returned, s.tup_fetched, s.tup_inserted, s.tup_updated, s.tup_deleted,
    s.conflicts,
    s.deadlocks,
    s.blk_read_time::int AS read_ms,
    s.blk_write_time::int AS write_ms,
    s.temp_files AS temp_file_count,
    s.temp_bytes AS temp_bytes
FROM pg_stat_database s
WHERE s.datname NOT IN ('template0', 'template1')
ORDER BY s.datname;
六、VACUUM 与膨胀分析
6.1 Autovacuum 配置参数
-- 检查自动清理是否开启及相关阈值设置
SELECT name, setting, short_desc
FROM pg_settings
WHERE name IN (
    'autovacuum',                    -- 是否开启自动清理
    'autovacuum_max_workers',        -- 自动清理工作进程数
    'autovacuum_naptime',            -- 自动清理轮询间隔
    'autovacuum_vacuum_threshold',   -- 触发 VACUUM 的死元组数阈值
    'autovacuum_vacuum_scale_factor',-- 触发 VACUUM 的死元组比例
    'autovacuum_analyze_threshold',  -- 触发 ANALYZE 的变更行数阈值
    'autovacuum_analyze_scale_factor',-- 触发 ANALYZE 的变更比例
    'autovacuum_vacuum_cost_delay',  -- 自动清理的延迟(ms)
    'autovacuum_vacuum_cost_limit',  -- 自动清理的成本限制
    'vacuum_cleanup_index_scale_factor' -- 索引清理的比例因子
)
ORDER BY name;
关键参数建议:

参数        默认值        建议值        说明
autovacuum        on        on        生产环境必须开启
autovacuum_vacuum_scale_factor        0.2        0.05~0.1        大表建议更低
autovacuum_max_workers        3        4~6        表多时适当增加
autovacuum_naptime        60s        30s~60s        高频写入库可缩短
6.2 表膨胀 Top15(需 VACUUM)
-- dead_ratio_pct: 死元组占活元组的百分比
-- 值过高说明 VACUUM 不及时,需要关注
SELECT
    schemaname,
    relname AS table_name,
    n_dead_tup AS dead_tuples,
    n_live_tup AS live_tuples,
    CASE WHEN n_live_tup > 0
         THEN ROUND(100.0 * n_dead_tup / n_live_tup, 2)
         ELSE 0
    END AS dead_ratio_pct,
    last_vacuum,
    last_autovacuum,
    vacuum_count,
    autovacuum_count
FROM pg_stat_user_tables
WHERE n_dead_tup > 1000
ORDER BY n_dead_tup DESC
LIMIT 15;
告警阈值: dead_ratio_pct > 10% 需要关注,> 50% 应立即 VACUUM。

6.3 未及时 VACUUM 的表
-- 超过 7 天未执行 VACUUM 且死元组 > 10000 的表
-- "从未 VACUUM" 的标记说明自动清理从未触及该表
SELECT
    schemaname,
    relname AS table_name,
    n_dead_tup,
    last_autovacuum,
    CASE WHEN last_autovacuum IS NULL AND last_vacuum IS NULL
         THEN '从未 VACUUM'
         ELSE ''
    END AS note
FROM pg_stat_user_tables
WHERE n_dead_tup > 10000
  AND (last_autovacuum IS NULL OR last_autovacuum < now() - interval '7 days')
  AND (last_vacuum IS NULL OR last_vacuum < now() - interval '7 days')
ORDER BY n_dead_tup DESC
LIMIT 10;
6.4 表膨胀详细分析(膨胀倍数 Top10)
-- bloat_ratio 反映索引空间与表空间的比例
-- 比例异常高说明索引可能严重膨胀
-- 仅分析大于 10MB 的表
SELECT
    schemaname,
    relname AS table_name,
    pg_size_pretty(pg_relation_size(relid)) AS table_size,
    ROUND(100.0 * pg_relation_size(relid) /
        NULLIF(pg_total_relation_size(relid) - pg_relation_size(relid), 0), 2) AS bloat_ratio
FROM pg_stat_user_tables
WHERE pg_relation_size(relid) > 10 * 1024 * 1024
ORDER BY bloat_ratio DESC NULLS LAST
LIMIT 10;
6.5 索引膨胀 Top10
-- 扫描次数少于 100 次但体积超过 1MB 的索引
-- avg_bytes_per_scan: 每次扫描平均消耗字节数,过大说明索引效率低
SELECT
    n.nspname AS schemaname,
    i.relname AS index_name,
    c.relname AS table_name,
    pg_size_pretty(pg_relation_size(i.oid)) AS index_size,
    s.idx_scan,
    s.idx_tup_read,
    s.idx_tup_fetch,
    pg_relation_size(i.oid) / NULLIF(s.idx_scan + 1, 0) AS avg_bytes_per_scan
FROM pg_stat_user_indexes s
JOIN pg_class i ON i.oid = s.indexrelid
JOIN pg_class c ON c.oid = s.relid
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE s.idx_scan < 100
  AND pg_relation_size(i.oid) > 1024 * 1024
  AND n.nspname NOT IN ('pg_catalog', 'information_schema')
ORDER BY pg_relation_size(i.oid) DESC
LIMIT 10;
VACUUM 模块建议:

Dead tuples 过多会导致查询变慢,建议调整 autovacuum_vacuum_scale_factor(推荐 0.05~0.1)
大表建议设置单独的 autovacuum_vacuum_cost_delay 或手工定期 VACUUM
未使用索引占用空间且影响写入性能,建议评估后删除
七、数据库年龄与事务分析
7.1 数据库年龄
-- current_age: 事务 ID 年龄,即距离冻结点的距离
-- age_remain: 距离 2^31 上限的剩余事务数
-- 当 age 接近 2^31 时会发生事务 ID 回卷,数据库将只读
SELECT
    datname AS database_name,
    age(datfrozenxid) AS current_age,
    2^31 - age(datfrozenxid) AS age_remain,
    CASE WHEN age(datfrozenxid) > 2^30
         THEN '⚠️ 告警: 年龄即将耗尽'
         WHEN age(datfrozenxid) > 2^29
         THEN '⚠️ 警告: 年龄接近上限'
         ELSE '正常'
    END AS status
FROM pg_database
WHERE datname NOT IN ('template0', 'template1')
ORDER BY age(datfrozenxid) DESC;
&#128308; XID 回卷风险是 PostgreSQL 最严重的故障之一! 当 age_remain 接近 0 时,数据库将停止接受写入。务必定期检查此指标。

7.2 表年龄 Top10
-- 表级别的冻结年龄,age > 100 万才显示
-- last_vacuum_time: 最近一次 VACUUM(含自动)的时间
SELECT
    n.nspname AS schemaname,
    c.relname AS table_name,
    c.relkind,
    age(c.relfrozenxid) AS current_age,
    2^31 - age(c.relfrozenxid) AS age_remain,
    GREATEST(s.last_vacuum, s.last_autovacuum) AS last_vacuum_time
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
LEFT JOIN pg_stat_user_tables s ON s.relid = c.oid
WHERE c.relkind IN ('r', 't', 'm')
  AND n.nspname NOT IN ('pg_catalog', 'information_schema', 'pg_toast')
  AND age(c.relfrozenxid) > 1000000
ORDER BY age(c.relfrozenxid) DESC
LIMIT 10;
7.3 长事务(运行超 30 分钟)
-- 长事务会阻止 VACUUM 清理死元组,是膨胀的主要诱因
-- backend_xid / backend_xmin: 表示该后端正在持有事务快照
SELECT
    pid,
    usename AS username,
    datname AS database,
    state,
    EXTRACT(EPOCH FROM (now() - xact_start))::int / 60 AS xact_minutes,
    EXTRACT(EPOCH FROM (now() - query_start))::int / 60 AS query_minutes,
    LEFT(REPLACE(query, chr(10), ' '), 120) AS query
FROM pg_stat_activity
WHERE state NOT IN ('idle')
  AND (backend_xid IS NOT NULL OR backend_xmin IS NOT NULL)
  AND xact_start < now() - interval '30 minutes'
ORDER BY xact_start;
处理建议: 确认后可使用 SELECT pg_terminate_backend(pid) 终止长事务,同时排查应用代码中的事务边界问题。

7.4 两阶段提交(2PC)预提交事务
-- 2PC 事务长时间未提交会阻止 VACUUM,导致表膨胀
-- prepared_minutes: 预提交事务已等待的分钟数
SELECT
    transaction,
    gid,
    prepared,
    owner,
    database AS db_name,
    EXTRACT(EPOCH FROM (now() - prepared)) / 60 AS prepared_minutes
FROM pg_prepared_xacts
ORDER BY prepared;
7.5 复制槽位(可能导致 WAL 积压)
-- 不活跃的复制槽会阻止 WAL 文件回收,导致 pg_wal 膨胀
-- lag_bytes: 当前 WAL 位置与槽位重启 LSN 的字节差
SELECT
    slot_name,
    plugin,
    slot_type,
    COALESCE(d.datname, 'N/A') AS database_name,
    active,
    restart_lsn,
    confirmed_flush_lsn,
    CASE WHEN restart_lsn IS NOT NULL AND pg_is_in_recovery() = false
         THEN pg_size_pretty(
             pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
         )
         ELSE NULL
    END AS lag_bytes
FROM pg_replication_slots rs
LEFT JOIN pg_database d ON rs.database::oid = d.oid;
⚠️ 重要: active = false 的复制槽是最常见的 WAL 膨胀原因,必须及时清理:SELECT pg_drop_replication_slot('slot_name');

八、WAL 与归档分析
8.1 WAL 配置参数
-- 核心参数解读:
--   wal_level: replica 支持流复制,logical 支持逻辑解码
--   archive_mode: on 才能实现 PITR
--   max_wal_size: 控制 WAL 总量上限,过小导致频繁检查点
SELECT name, setting, short_desc
FROM pg_settings
WHERE name IN (
    'wal_level', 'max_wal_senders', 'max_replication_slots',
    'wal_keep_size', 'min_wal_size', 'max_wal_size',
    'archive_mode', 'archive_command', 'archive_timeout',
    'wal_compression', 'wal_log_hints', 'wal_init_zero'
)
ORDER BY name;
8.2 归档统计
-- archived_count: 已成功归档的 WAL 数
-- failed_count: 归档失败次数(非 0 需立即排查)
-- last_archived_wal / last_archived_time: 最近归档信息
SELECT * FROM pg_stat_archiver;
8.3 当前 WAL 位置
-- current_lsn: 当前写入的 WAL 位置
-- current_wal_file: 当前正在写入的 WAL 文件名
-- bytes_in_wal: 自起始以来的 WAL 字节数
SELECT
    pg_current_wal_lsn() AS current_lsn,
    pg_walfile_name(pg_current_wal_lsn()) AS current_wal_file,
    pg_wal_lsn_diff(pg_current_wal_lsn(), '0/0'::pg_lsn) AS bytes_in_wal;
8.4 流复制状态(主库视角)
-- 主库查看所有备库的复制延迟
-- lag_bytes: 主库已发送但备库尚未回放的 WAL 字节数
-- sync_state: sync=同步, async=异步, potential=候选同步
SELECT
    client_addr,
    usename,
    application_name,
    state,
    sent_lsn,
    write_lsn,
    flush_lsn,
    replay_lsn,
    pg_wal_lsn_diff(sent_lsn, replay_lsn) AS lag_bytes,
    sync_state
FROM pg_stat_replication;
备库视角:

-- 在备库上执行,查看 WAL 接收和回放状态
-- replay_lag: 接收到回放的延迟
SELECT
    pid,
    status,
    receive_start_lsn,
    received_lsn,
    last_msg_send_time,
    last_msg_receipt_time,
    latest_end_lsn,
    slot_name,
    pg_wal_lsn_diff(pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn()) AS replay_lag
FROM pg_stat_wal_receiver;
九、锁等待分析
9.1 当前锁统计
-- 按 locktype / relation / mode 分组统计锁数量
-- locktype 类型: relation, extend, page, tuple, transactionid, virtualxid, object, userlock, advisory
SELECT
    locktype,
    database::regclass AS database,
    relation::regclass AS relation,
    page, tuple,
    virtualxid, transactionid,
    granted,
    mode,
    COUNT(*) AS lock_count
FROM pg_locks
WHERE locktype NOT IN ('virtualxid')
GROUP BY 1,2,3,4,5,6,7,8,9
ORDER BY lock_count DESC, granted
LIMIT 30;
9.2 等待中的锁(谁阻塞了谁)
-- 找到被阻塞的会话及其阻塞源
-- waiting_pid: 被阻塞的会话 PID
-- blocking_pid: 持有锁的会话 PID
SELECT
    w.pid AS waiting_pid,
    w.locktype AS waiting_locktype,
    COALESCE(w.relation::regclass::text, w.transactionid::text, w.virtualxid) AS waiting_object,
    w.mode AS waiting_mode,
    w.granted AS waiting_granted,
    b.pid AS blocking_pid,
    b.usename AS blocking_user,
    LEFT(REPLACE(b.query, chr(10), ' '), 80) AS blocking_query
FROM pg_locks w
JOIN pg_stat_activity b ON b.pid = w.pid
WHERE NOT w.granted
ORDER BY w.pid;
9.3 长时间持锁(超 10 秒)
-- 已授予但持锁时间超 10 秒的会话
-- 可能影响其他事务的并发性能
SELECT
    l.pid,
    l.locktype,
    l.mode,
    l.relation::regclass,
    l.granted,
    a.usename,
    a.datname,
    a.query,
    EXTRACT(EPOCH FROM (now() - a.xact_start))::int AS xact_seconds,
    EXTRACT(EPOCH FROM (now() - a.query_start))::int AS query_seconds
FROM pg_locks l
JOIN pg_stat_activity a ON l.pid = a.pid
WHERE l.granted
  AND a.query_start < now() - interval '10 seconds'
ORDER BY a.query_start;
9.4 死锁检测
-- 查看各数据库的死锁累计次数
-- 仅显示死锁数 > 0 的数据库
SELECT
    datname,
    SUM(deadlocks)::int AS deadlock_count
FROM pg_stat_database
GROUP BY datname
HAVING SUM(deadlocks) > 0;
死锁排查建议: 死锁通常是应用事务逻辑问题(交叉更新、缺少 SELECT FOR UPDATE 等),需在应用层修复。deadlocks 是累计值,重启后清零。

十、索引分析
10.1 未使用索引 Top20
-- idx_scan = 0 说明自上次统计重置后从未使用过
-- 体积 > 64KB 才显示(过滤小索引)
-- 这些索引只占用空间、拖慢 DML,建议评估删除
SELECT
    n.nspname AS schemaname,
    c.relname AS table_name,
    i.relname AS index_name,
    pg_size_pretty(pg_relation_size(i.oid)) AS index_size,
    s.idx_scan,
    s.idx_tup_read,
    s.idx_tup_fetch
FROM pg_stat_user_indexes s
JOIN pg_class i ON i.oid = s.indexrelid
JOIN pg_class c ON c.oid = s.relid
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE s.idx_scan = 0
  AND pg_relation_size(i.oid) > 65536
  AND i.relname NOT LIKE 'pg_toast_%'
  AND n.nspname NOT IN ('pg_catalog', 'information_schema')
ORDER BY pg_relation_size(i.oid) DESC
LIMIT 20;
10.2 索引使用率偏低(扫描 1~100 次但占用空间大)
-- bytes_per_scan: 每次扫描消耗的字节数
-- 体积 > 10MB 且扫描极少的索引性价比极低
SELECT
    n.nspname AS schemaname,
    c.relname AS table_name,
    i.relname AS index_name,
    pg_size_pretty(pg_relation_size(i.oid)) AS index_size,
    s.idx_scan,
    pg_relation_size(i.oid) / NULLIF(s.idx_scan, 0) AS bytes_per_scan
FROM pg_stat_user_indexes s
JOIN pg_class i ON i.oid = s.indexrelid
JOIN pg_class c ON c.oid = s.relid
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE s.idx_scan BETWEEN 1 AND 100
  AND pg_relation_size(i.oid) > 10 * 1024 * 1024
  AND n.nspname NOT IN ('pg_catalog', 'information_schema')
ORDER BY pg_relation_size(i.oid) DESC
LIMIT 10;
10.3 重复索引
-- 查找同一列集上有多个相同定义的非唯一、非主键索引
-- 重复索引纯属浪费,应删除多余的那个
SELECT
    t.relname AS table_name,
    array_agg(DISTINCT i.indexrelid::regclass) AS duplicate_indexes,
    pg_get_indexdef(i.indexrelid) AS index_definition,
    COUNT(*) AS count
FROM pg_index i
JOIN pg_class t ON t.oid = i.indrelid
WHERE i.indisunique = false
  AND i.indisprimary = false
GROUP BY t.relname, i.indexrelid, pg_get_indexdef(i.indexrelid)
HAVING COUNT(*) > 1;
10.4 无效索引
-- indisvalid = false 的索引是 CONCURRENTLY 创建失败后遗留的
-- 不会被查询使用,但仍然占用磁盘空间
-- 处理: DROP INDEX CONCURRENTLY index_name; 然后重建
SELECT
    n.nspname AS schemaname,
    c.relname AS table_name,
    i.relname AS invalid_index_name,
    ix.indexrelid::regclass AS invalid_index
FROM pg_index ix
JOIN pg_class i ON i.oid = ix.indexrelid
JOIN pg_class c ON c.oid = ix.indrelid
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE ix.indisvalid = false
  AND n.nspname NOT IN ('pg_catalog', 'information_schema');
10.5 索引过多的表(> 5 个索引)
-- 索引过多会加重写入开销和表膨胀
-- 建议单表索引数控制在 5 个以内
SELECT
    n.nspname AS schemaname,
    c.relname AS table_name,
    COUNT(*) AS index_count,
    pg_size_pretty(SUM(pg_relation_size(i.oid))) AS total_index_size,
    pg_size_pretty(pg_relation_size(c.oid)) AS table_size
FROM pg_stat_user_indexes s
JOIN pg_class i ON i.oid = s.indexrelid
JOIN pg_class c ON c.oid = s.relid
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE n.nspname NOT IN ('pg_catalog', 'information_schema')
GROUP BY n.nspname, c.relname, c.oid
HAVING COUNT(*) > 5
ORDER BY COUNT(*) DESC
LIMIT 15;
十一、用户与权限审计
11.1 用户列表
-- rolsuper: 是否超级用户
-- rolcanlogin: 是否允许登录
-- rolvaliduntil: 密码有效期
-- password_status: 自动判断密码是否过期/即将过期
SELECT
    rolname AS username,
    rolsuper AS is_superuser,
    rolinherit AS inherits,
    rolcreaterole AS can_create_role,
    rolcreatedb AS can_create_db,
    rolcanlogin AS can_login,
    rolreplication AS is_replication,
    rolconnlimit AS conn_limit,
    rolvaliduntil AS valid_until,
    CASE WHEN rolvaliduntil < now() THEN '⚠️ 已过期'
         WHEN rolvaliduntil < now() + interval '30 days' THEN '⚠️ 即将过期'
         ELSE '正常'
    END AS password_status
FROM pg_roles
WHERE rolname NOT IN ('pg_read_all_settings', 'pg_read_all_stats',
                      'pg_stat_scan_tables', 'pg_signal_backend')
ORDER BY rolcanlogin DESC, rolname;
11.2 超级用户检查
-- 生产环境应尽量减少超级用户数量
-- 应用账号不应使用 superuser 权限
SELECT
    rolname AS superuser_name,
    oid,
    rolvaliduntil
FROM pg_authid
WHERE rolsuper = true
ORDER BY rolname;
11.3 密码过期检查
-- 检查所有可登录用户的密码有效期
-- 永不过期(NULL) 的用户建议设置有效期
SELECT
    rolname AS username,
    rolvaliduntil,
    CASE WHEN rolvaliduntil IS NULL THEN '永不过期'
         WHEN rolvaliduntil < now() THEN '⚠️ 已过期'
         WHEN rolvaliduntil < now() + interval '30 days' THEN '⚠️ 30天内过期'
         ELSE '正常'
    END AS status
FROM pg_roles
WHERE rolcanlogin = true
  AND rolname != 'postgres'
ORDER BY rolvaliduntil;
11.4 public 权限检查
-- 检查是否有对 public 开放了非默认权限
-- public 权限过宽是常见安全隐患
SELECT
    n.nspname AS schema_name,
    c.relname AS object_name,
    c.relkind AS object_type,
    '非默认权限' AS privilege_type
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
JOIN pg_default_acl d ON d.defaclnamespace = n.oid
WHERE d.defaclacl::text ~ '.*=.*/.*'
  AND n.nspname NOT IN ('pg_catalog', 'information_schema');
11.5 默认权限检查
-- 查看 ALTER DEFAULT PRIVILEGES 设置的默认权限
-- 确保新建对象不会自动继承过宽权限
SELECT n.nspname, d.defaclnamespace::regnamespace AS schema_name,
    d.defaclobjtype, d.defaclacl
FROM pg_default_acl d
JOIN pg_namespace n ON n.oid = d.defaclnamespace
WHERE n.nspname NOT IN ('pg_catalog', 'information_schema')
LIMIT 20;
十二、安全配置检查
12.1 安全相关参数
-- SSL、密码加密、日志审计等安全参数一览
SELECT name, setting, short_desc
FROM pg_settings
WHERE name IN (
    'ssl', 'ssl_ca_file', 'ssl_cert_file', 'ssl_key_file', 'ssl_ciphers',
    'password_encryption', 'bcrypt_rounds',
    'shared_preload_libraries',
    'log_connections', 'log_disconnections', 'log_duration',
    'log_line_prefix', 'log_statement', 'log_min_duration_statement',
    'log_checkpoints', 'log_lock_waits', 'log_temp_files',
    'row_security', 'enable_row_security'
)
ORDER BY name;
12.2 SSL 配置检查
-- 当 ssl = on 时查看 SSL 证书配置
-- 生产环境强烈建议开启 SSL 加密
SELECT name, setting FROM pg_settings
WHERE name LIKE 'ssl_%' AND setting != '';
12.3 密码加密方式
-- password_encryption 应为 scram-sha-256
-- md5 已不安全,新版本默认 scram-sha-256
-- 升级步骤: SET password_encryption = 'scram-sha-256'; 然后 ALTER USER xxx PASSWORD 'xxx';
SELECT name, setting FROM pg_settings
WHERE name = 'password_encryption';
安全加固清单:

检查项        推荐值        风险等级
ssl        on        &#128308; 生产必须
password_encryption        scram-sha-256        &#128308; md5 已不安全
log_connections        on        &#128993; 审计需要
log_disconnections        on        &#128993; 审计需要
log_statement        ddl        &#128993; 至少记录 DDL
log_min_duration_statement        1000 (ms)        &#128994; 慢查询审计
十三、错误与慢查询日志
13.1 日志配置参数
-- logging_collector: 是否开启日志收集器
-- log_min_duration_statement: 慢查询阈值(-1 表示关闭)
-- log_rotation_age/size: 日志轮转策略
SELECT name, setting, short_desc
FROM pg_settings
WHERE name IN (
    'log_destination', 'logging_collector', 'log_directory',
    'log_filename', 'log_rotation_age', 'log_rotation_size',
    'log_min_messages', 'log_min_error_statement', 'log_min_duration_statement',
    'log_connections', 'log_disconnections', 'log_duration',
    'log_line_prefix', 'log_statement', 'log_temp_files',
    'log_lock_waits', 'log_checkpoints', 'log_autovacuum_min_duration'
)
ORDER BY name;
13.2 pg_stat_statements Top 20(需安装扩展)
-- 按 total_exec_time 降序排列,定位最耗时的 SQL
-- calls: 执行次数
-- mean_exec_time: 平均执行时间(ms)
-- shared_blks_read/hit: 共享缓冲区读写块数
-- 需先: CREATE EXTENSION pg_stat_statements;
-- 并在 shared_preload_libraries 中加入 pg_stat_statements
SELECT
    LEFT(query, 150) AS query,
    calls,
    total_exec_time,
    mean_exec_time,
    max_exec_time,
    rows,
    shared_blks_hit,
    shared_blks_read,
    shared_blks_written
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;
安装提示: 需在 postgresql.conf 中设置 shared_preload_libraries = 'pg_stat_statements',然后重启实例并执行 CREATE EXTENSION pg_stat_statements;

十四、备份策略检查
本模块为 OS 层面检查,无 SQL 脚本,通过 shell 命令验证备份状态。

检查清单:

检查项        方法        说明
备份目录        ls -la /data/backup/        确认备份文件存在且更新
最近备份时间        stat /data/backup/latest.dump        48 小时内有新备份
crontab 定时任务        crontab -l \| grep pg_dump        确认有定时备份
WAL 归档        ls $PGDATA/pg_wal/archive_status/        无 .ready 残留
备份恢复测试        定期执行 PITR 演练        至少每季度一次
十五、健康检查汇总
以下 7 条汇总 SQL 快速评估实例整体健康状态。

15.1 Autovacuum 是否开启
SELECT setting FROM pg_settings WHERE name = 'autovacuum';
-- 期望值: on
15.2 归档模式是否开启
SELECT setting FROM pg_settings WHERE name = 'archive_mode';
-- 期望值: on(生产环境)
15.3 SSL 是否启用
SELECT setting FROM pg_settings WHERE name = 'ssl';
-- 期望值: on(生产环境)
15.4 密码加密方式
SELECT setting FROM pg_settings WHERE name = 'password_encryption';
-- 期望值: scram-sha-256
15.5 连接数余量
SELECT
    MAX(setting)::int AS max_connections,
    (SELECT COUNT(*) FROM pg_stat_activity) AS current_connections
FROM pg_settings
WHERE name = 'max_connections';
-- current / max < 80% 为健康
15.6 缓冲区命中率
SELECT ROUND(100.0 * SUM(blks_hit) / NULLIF(SUM(blks_hit) + SUM(blks_read), 0), 2) AS hit_ratio_pct
FROM pg_stat_database
WHERE datname = current_database();
-- > 99% 优秀,> 95% 正常,< 90% 需优化
15.7 死锁次数
SELECT SUM(deadlocks)::int AS total_deadlocks FROM pg_stat_database;
-- 期望值: 0(累计值,重启后清零)
健康检查汇总速查表
检查项        正常值        告警条件        处理建议
Autovacuum        on        off        立即开启
Archive Mode        on        off        开启归档支持 PITR
SSL        on        off        生产环境必须开启
密码加密        scram-sha-256        md5        升级加密方式
连接数使用率        < 80%        > 90%        增加 max_connections 或优化连接池
Buffer 命中率        > 99%        < 95%        增加 shared_buffers 或优化查询
死锁次数        0        > 0        排查应用事务逻辑
数据库年龄        < 10 亿        > 15 亿        执行 VACUUM FREEZE
复制槽 Lag        < 1GB        > 10GB        检查备库状态或清理无效槽
未归档 WAL        0        > 0        排查归档命令
总结 & 注意事项
执行权限:本文所有 SQL 均需要超级用户(postgres)或 pg_read_all_stats 角色权限才能获取完整信息
性能影响:所有查询均基于统计视图(pg_stat_*、pg_settings),对生产库无性能影响;仅性能采样模块需要 10~30 秒间隔
版本差异:部分视图在 PostgreSQL 12 以下版本不可用(如 pg_stat_wal_receiver 在 9.6+ 可用,pg_control_system() 在 12+ 可用)
统计重置:pg_stat_* 视图为累计值,实例重启后清零;可使用 pg_stat_reset() 手动重置
XID 回卷:这是 PostgreSQL 特有的严重风险,Oracle/MySQL DBA 转型时需重点关注,务必确保 autovacuum 正常运行
WAL 膨胀:不活跃的复制槽和失败的归档是 WAL 膨胀的两大主因,巡检时重点检查
连接泄漏:idle in transaction 是 PG 特有的问题(Oracle 中不存在),建议配置 idle_in_transaction_session_timeout
定期巡检:建议每日自动执行巡检脚本,周报对比关键指标变化趋势
版本差异说明:

PostgreSQL 12+:所有 SQL 均可直接使用
PostgreSQL 9.6~11:pg_control_system() 不可用,需用 pg_controldata 命令替代;wal_keep_size 参数名为 wal_keep_segments
PostgreSQL 14+:pg_stat_statements 中 total_time 更名为 total_exec_time

分享到:  QQ好友和群QQ好友和群 QQ空间QQ空间 腾讯微博腾讯微博 腾讯朋友腾讯朋友
收藏收藏 支持支持 反对反对
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 注册

本版积分规则

QQ|手机版|小黑屋|重庆思庄Oracle、Redhat认证学习论坛 ( 渝ICP备12004239号-4 )

GMT+8, 2026-5-30 18:59 , Processed in 0.214464 second(s), 22 queries .

重庆思庄学习中心论坛-重庆思庄科技有限公司论坛

© 2001-2020

快速回复 返回顶部 返回列表