prometheus 监控 mysql

郑全 · 发表于 2024-9-28 12:46:35

本帖最后由郑全于 2024-9-28 13:12 编辑

一.环境
prometheus监控服务器：192.168.133.110
被监控服务器--mysql 机器：192.168.133.191

二、prometheus安装
假设目前已经安装好，步骤略。

三、在被监控服务器上安装 mysqld_exporter
3.1.创建监控用户
   # mysql mysql -u root -p
   Enter password:

   # 创建用户：
   CREATE USER 'exporter'@'%' IDENTIFIED BY 'password' WITH MAX_USER_CONNECTIONS 3;
   GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'%';
   flush privileges;

   # 验证登录
   mysql mysql -uexporter -p

3.2.安装mysqld_exporter
   1.下载软件 mysqld_exporter

   wget https://github.com/prometheus/mysqld_exporter/releases/download/v0.15.1/mysqld_exporter-0.15.1.linux-amd64.tar.gz
   tar vxf mysqld_exporter-0.14.0.linux-amd64.tar.gz
   mkdir /opt/prometheus/
   mv mysqld_exporter-0.14.1.linux-amd64 /opt/prometheus/mysqld_exporter

   2.创建用户

   useradd -M -s /usr/sbin/nologin prometheus

   3.创建连接数据库文件
   cat >/opt/prometheus/mysqld_exporter/.mysqld_exporter.cnf << "EOF"

   [client]
      user=exporter
      password=password
      host=192.168.133.191
      port=3306
      EOF

   4.更改exporter 文件夹权限

      chmod 600 /opt/prometheus/mysqld_exporter/.mysqld_exporter.cnf

      chown prometheus:prometheus -R /opt/prometheus

   5.创建systemd服务
cat > /etc/systemd/system/mysqld_exporter.service <<"EOF"
[Unit]
Description=Prometheus MySQL Exporter
After=network.target

[Service]
Type=simple
User=prometheus
Group=prometheus
Restart=always
ExecStart=/opt/prometheus/mysqld_exporter/mysqld_exporter \
--config.my-cnf=/opt/prometheus/mysqld_exporter/.mysqld_exporter.cnf \
--collect.auto_increment.columns \
--collect.info_schema.processlist \
--collect.binlog_size \
--collect.info_schema.tablestats \
--collect.info_schema.innodb_metrics \
--collect.info_schema.userstats \
--collect.info_schema.tables \
--collect.perf_schema.tablelocks \
--collect.perf_schema.indexiowaits \
--collect.perf_schema.tableiowaits \
--collect.slave_status

[Install]
WantedBy=multi-user.target
EOF

6.启动mysqld_exporter
systemctl daemon-reload
systemctl start mysqld_exporter
systemctl enable mysqld_exporter

7.参数解释

Name	MySQL Version	Description
collect.auto_increment.columns	5.1	从 information_schema 收集 auto_increment 列和最大值.
collect.binlog_size	5.1	收集所有注册的binlog文件的当前大小
collect.engine_innodb_status	5.1	收集SHOW ENGINE INNODB STATUS
collect.engine_tokudb_status	5.6	收集SHOW ENGINE TOKUDB STATUS .
collect.global_status	5.1	收集SHOW GLOBAL STATUS（默认启用）
collect.global_variables	5.1	收集SHOW GLOBAL VARIABLES（默认启用）
collect.info_schema.clientstats	5.5	如果以 userstat=1 运行，设置为 true 以收集客户端统计信息
collect.info_schema.innodb_metrics	5.6	从 information_schema.innodb_metrics收集指标
collect.info_schema.innodb_tablespaces	5.7	从information_schema.innodb_sys_tablespaces收集指标
collect.info_schema.innodb_cmp	5.5	从information_schema.innodb_cmp收集 InnoDB 压缩表指标。
collect.info_schema.innodb_cmpmem	5.5	从information_schema.innodb_cmpmem缓冲池压缩指标。
collect.info_schema.processlist	5.1	从 information_schema.processlist 收集线程状态计数
collect.info_schema.processlist.min_time	5.1	线程必须处于要计算的每个状态的最短时间。（默认值：0）
collect.info_schema.query_response_time	5.5	如果 query_response_time_stats 为 ON，则收集查询响应时间分布。
collect.info_schema.replica_host	5.6	从 information_schema.replica_host_status 收集指标。
collect.info_schema.tables	5.1	从information_schema.tables收集指标。
collect.info_schema.tables.databases	5.1	要为其收集表统计信息的数据库列表，或为所有
collect.info_schema.tablestats	5.1	如果以 userstat=1 运行，设置为 true 以收集表统计信息。
collect.info_schema.schemastats	5.1	如果以 userstat=1 运行，设置为 true 以收集架构统计信息
collect.info_schema.userstats	5.1	如果以 userstat=1 运行，设置为 true 以收集用户统计信息。
collect.mysql.user	5.5	从 mysql.user 表中收集数据
collect.perf_schema.eventsstatements	5.6	从 performance_schema.events_statements_summary_by_digest 收集指标。
collect.perf_schema.eventsstatements.digest_text_limit	5.6	规范化语句文本的最大长度。（默认值：120）
collect.perf_schema.eventsstatements.limit	5.6	按响应时间限制事件语句摘要的数量。（默认值：250）
collect.perf_schema.eventsstatements.timelimit	5.6	以秒为单位限制“last_seen”事件语句的存在时间。（默认值：86400）
collect.perf_schema.eventsstatementssum	5.7	从 performance_schema.events_statements_summary_by_digest 汇总收集指标。
collect.perf_schema.eventswaits	5.5	从 performance_schema.events_waits_summary_global_by_event_name 收集指标
collect.perf_schema.file_events	5.6	从 performance_schema.file_summary_by_event_name 收集指标
collect.perf_schema.file_instances	5.5	从 performance_schema.file_summary_by_instance 收集指标。
collect.perf_schema.file_instances.remove_prefix	5.5	删除 performance_schema.file_summary_by_instance 中的路径前缀。
collect.perf_schema.indexiowaits	5.6	从 performance_schema.table_io_waits_summary_by_index_usage 收集指标。
collect.perf_schema.memory_events	5.7	从 performance_schema.memory_summary_global_by_event_name 收集指标。
collect.perf_schema.memory_events.remove_prefix	5.7	删除 performance_schema.memory_summary_global_by_event_name 中的仪器前缀。
collect.perf_schema.tableiowaits	5.6	从 performance_schema.table_io_waits_summary_by_table 收集指标。
collect.perf_schema.tablelocks	5.6	从 performance_schema.table_lock_waits_summary_by_table 收集指标。
collect.perf_schema.replication_group_members	5.7	从 performance_schema.replication_group_members 收集指标。
collect.perf_schema.replication_group_member_stats	5.7	从 performance_schema.replication_group_member_stats 收集指标。
collect.perf_schema.replication_applier_status_by_worker	5.7	从 performance_schema.replication_applier_status_by_worker 收集指标。
collect.slave_status	5.1	从 SHOW SLAVE STATUS 收集（默认启用）
collect.slave_hosts	5.1	从 SHOW SLAVE HOSTS 收集
collect.heartbeat	5.1	从心跳收集
collect.heartbeat.database	5.1	从哪里收集心跳数据的数据库。（默认：心跳）
collect.heartbeat.table	5.1	从哪里收集心跳数据的表。（默认：心跳）
collect.heartbeat.utc	5.1	使用 UTC 作为当前服务器的时间戳（使用 --utc 调用 pt-heartbeat ）。（默认值：假）

四、PROMETHEUS配置
   在监控服务器上配置：
      4.1 配置prometheus去采集（拉取）mysql_exporter的监控样本数据

   #在scrape_configs(搜刮配置):下面增加如下配置：    cat >> /opt/prometheus/prometheus.yml << "EOF"
   - job_name: 'mysqld_exporter'
      static_configs:
         - targets: ['192.168.133.191:9104']
         labels:       instance: test服务器
   EOF

   4.2 重新加载配置并检查：
      http://192.168.133.191:9104/metrics

   4.3 常见监控指标

   mysql_up                                                                         # 服务器是否在线    mysql_global_status_uptime                                              # 运行时长，单位
   sdelta(mysql_global_status_bytes_received[1m])       # 网络接收的
   bytesdelta(mysql_global_status_bytes_sent[1m])          # 网络发送的

   bytesmysql_global_status_threads_connected                # 当前的客户端连接数
   mysql_global_variables_max_connections             # 允许的最大连接数
   mysql_global_status_threads_running                # 正在执行命令的客户端连接数，即非 sleep 状态
   delta(mysql_global_status_aborted_connects[1m])    # 客户端建立连接失败的连接数，比如登录失败
   delta(mysql_global_status_aborted_clients[1m])       # 客户端连接之后，未正常关闭的连接数

   delta(mysql_global_status_commands_total{command="xx"}[1m]) > 0    # 每分钟各种命令的次数
   delta(mysql_global_status_handlers_total{handler="xx"}[1m]) > 0    # 每分钟各种操作的次数
   delta(mysql_global_status_handlers_total{handler="commit"}[1m]) > 0 # 每分钟 commit 的次数
   delta(mysql_global_status_table_locks_immediate[1m])  # 请求获取锁，且立即获得的请求数
   delta(mysql_global_status_table_locks_waited[1m])    # 请求获取锁，但需要等待的请求数。该值越少越好

   delta(mysql_global_status_queries[1m])             # 每分钟的查询数
   delta(mysql_global_status_slow_queries[1m])          # 慢查询数。如果未启用慢查询日志，则为 0
   mysql_global_status_innodb_page_size                # innodb 数据页的大小，单位 bytes
   mysql_global_variables_innodb_buffer_pool_size       # innodb_buffer_pool 的限制体积
   mysql_global_status_buffer_pool_pages{state="data"} # 包含数据的数据页数，包括洁页、脏页
   mysql_global_status_buffer_pool_dirty_pages          # 脏页数

   锁指标

   mysql_global_status_innodb_row_lock_current_waits #当前正在等待的 InnoDB 行锁数量。
   mysql_global_status_innodb_row_lock_time #从服务器启动以来的总 InnoDB 行锁等待时间（以毫秒为单位）。
   mysql_global_status_innodb_row_lock_time_avg #每次等待 InnoDB 行锁的平均时间（以毫秒为单位）
   mysql_global_status_innodb_row_lock_time_max #单次等待 InnoDB 行锁的最长时间（以毫秒为单位）。
   mysql_global_status_innodb_row_lock_waits  #从服务器启动以来的总 InnoDB 行锁等待次数。

   4.4 grafana加载模板
      在Grafana官网中找到mysql的监控模板 Mysql Overview，ID为7362，使用ID的方式导入到Grafana中去
      // 如果是MYSQL8.0,可以加载 id：20016，信息更丰富。

五、告警

在Prometheus的报警系统中，是分为2个部分的，规则是配置是在prometheus中的， prometheus组件完成报警推送给alertmanager的，
alertmanager然后管理这些报警信息，包括静默、抑制、聚合和通过电子邮件、on-call通知系统和聊天平台等方法发送通知。

主要步骤如下，安装在Prometheus服务器即可
安装和部署alertmanager
在prometheus中配置alertmanager的地址信息。
在prometheus中设置报警规则
在alertmanager配置接受者信息等

5.1 安装告警 alertmanager
   下载：https://github.com/prometheus/al ... .linux-amd64.tar.gz

   tar xvf alertmanager-0.27.0.linux-amd64.tar.gz
      mv alertmanager-0.27.0.linux-amd64 /opt/prometheus/
      cd /opt/prometheus/
      ln -s alertmanager-0.27.0.linux-amd64 alertmanager

5.2 开机自启动
      vim /etc/systemd/system/alertmanager.service
   [Unit]
   Description=alertmanager
   After=network.target

   [Service]
   User=root
   Group=root
   WorkingDirectory=/opt/prometheus/alertmanager
   ExecStart=/opt/prometheus/alertmanager/alertmanager  --log.level=debug --log.format=json

   [Install]
   WantedBy=multi-user.target

   # 通知systemd重新加载配置文件,记得重启服务器即可
   systemctl daemon-reload
   #启动
   systemctl enable alertmanager
   systemctl start alertmanager
   systemctl status alertmanager

5.3 集成alertmanager
   部署完毕alertmanager，需要告知prometheus告警信息推送的位置，通过如下配置即可完成。相对比较简单。
   vi /opt/prometheus/prometheus.yml
   # 修改此文件中的alerting的配置如下配置
   alerting:  alertmanagers:
      - static_configs:
         - targets:
            - 192.168.133.110:9093

   配置完毕后重启prometheus服务。
   systemctl restart prometheus

5.4 创建报警规则
   为了能先走通流程，这里的报警规则先弄一个简单一点的。
   vi prometheus.yml

   rule_files:
      - "rules/*rules.yml"
      # - "second_rules.yml"

   cd /opt/prometheus/
   mkdir rules
   vi node_rules.yml

   groups:
   - name: MySQL  rules:
   - alert: MysqlDown
      expr: mysql_up == 0
      for: 30s
      labels:
         severity: critical
      annotations:
         summary: "MySQL Down,实例:{{ $labels.instance }}"
         description: "MySQL_exporter连不上MySQL了，当前状态为：{{ $value }}"

      - alert: MysqlTooManyConnections
      expr: max_over_time(mysql_global_status_threads_connected[1m]) / mysql_global_variables_max_connections * 100 > 80
      for: 2m
      labels:
         severity: warning
      annotations:
         summary: "Mysql连接数过多告警,实例:{{ $labels.instance }}"
         description: "MySQL连接数>80%,当前值：{{ $value }}"

      - alert: MysqlHighThreadsRunning
         expr: max_over_time(mysql_global_status_threads_running[1m]) > 20
         for: 2m
         labels:
         severity: warning
         annotations:
         summary: "Mysql运行的线程过多,实例:{{ $labels.instance }}"
         description: "Mysql运行的线程 > 20，当前运行的线程：{{ $value }}"

      - alert: MysqlSlowQueries
         expr: increase(mysql_global_status_slow_queries[2m]) > 0
         for: 2m
         labels:
            severity: warning
         annotations:
            summary: "Mysql慢日志告警,实例:{{ $labels.instance }}"
            description: "MySQL在过去2分钟有新的{{ $value }}条慢查询"

      - alert: MysqlInnodbLogWaits
         expr: rate(mysql_global_status_innodb_log_waits[15m]) > 10
         for: 0m
         labels:
            severity: warning
         annotations:
            summary: "MySQL innodb日志等待,实例:{{ $labels.instance }}"
            description: "MySQL innodb日志写入停滞，当前值： {{ $value }}"

      - alert: MysqlRestarted
         expr: mysql_global_status_uptime < 60
         for: 0m
         labels:
            severity: info
         annotations:
            summary: "MySQL 重启,实例:{{ $labels.instance }}"
            description: "不到一分钟前，MySQL重启过"

      - alert: RowLockCurrentWaits
         expr: mysql_global_status_innodb_row_lock_current_waits > 0
         for: 1m
         labels:
            severity: info
         annotations:
            summary: "MySQL有锁等待,实例:{{ $labels.instance }}"
            description: "当前有{{ $value }}个锁等待"

重启prometheus服务，可以在web界面看到如下信息

   systemctl restart prometheus

   http://192.168.133.110:9090/alerts?search=

到此，prometheus 简单的监控就完成了。

郑全 · 发表于 2024-9-28 13:13:09

登录/注册后可看大图

郑全 · 发表于 2024-9-28 13:17:41

郑全发表于 2024-9-28 13:13

登录/注册后可看大图

帐号		自动登录	找回密码
密码			注册

[安装] prometheus 监控 mysql