PVE 8 全集群可观测实战：30 分钟搭好 Prometheus + Grafana，关键指标一键告警

一、为什么一定要上 Prometheus

Proxmox VE 8 自带 Web 界面，但只能看“单节点、单 VM”的实时曲线，历史数据 1 小时就滚掉。
生产环境出现以下场景会抓瞎：

• 凌晨 3 点宿主机 CPU 飙高，谁最先打满？
• 备份任务是否真跑成功了？失败也没人知道。
• 共享存储只剩 5 %，vMotion 直接失败。

Prometheus 把“节点/虚拟机/存储/网络/备份”全部拉成时序数据，Grafana 一屏展示，外加企业微信/钉钉/邮件告警——运维才能睡安稳觉。

二、架构一览（极简 3 容器）

+----------------+        +----------------+        +---------------+|  prometheus-pve|        |  Prometheus    |        |  Grafana      ||  -exporter     |--9221--|  (scrape)      |--9090--|  (dashboard)  ||  1 个就够      |        |  1 个就够      |        |  1 个就够     |+----------------+        +----------------+        +---------------+        |                         |                         |        ▼                         ▼                         ▼ PVE 8 API (HTTPS 8006)     存储 TSDB                  看板/告警

说明

• exporter 只需部署 1 份，就能把整个 PVE 8 集群所有节点一次性抓完（通过 /cluster/resources 接口）。
• Prometheus 与 Grafana 可以复用公司现有实例，也可以 Docker 一键起。
• 全部走内网，8006 端口只读 token，无安全风险。

三、PVE 8 侧 3 步准备（5 分钟）

1. 创建最小权限账号

pveum user add exporter@pve --comment "Prometheus监控"pveum aclmod / -user exporter@pve -role PVEAuditor   # 只读即可

2. 发长期 Token（不用每次输密码）

pveum user token add exporter@pve monitoring --privsep 0# 记下返回的 UUID，类似# b766788a-4828-4f38-1234-xxxxxxxxxxxx

3. 确认时区一致（否则时间戳对不上）
```
timedatectl set-timezone Asia/Shanghai
```

四、起 exporter（Docker 一行命令）

创建 /root/pve.yml：

default:  user: exporter@pve!monitoring        # 注意 !token 格式  token: b766788a-4828-4f38-1234-xxxxxxxxxxxx  verify_ssl: false                    # PVE8 默认自签  collectors:    snapshots: true                    # 打开快照指标    backups: true                      # 打开备份指标

启动容器：

docker run -d --restart=always --name pve-exporter \  -p 9221:9221 \  -v /root/pve.yml:/etc/prometheus/pve.yml \  prompve/prometheus-pve-exporter:3.5.1

浏览器访问 http://<宿主机>:9221/pve?target=localhost 能看到纯文本，代表 PVE 8 数据已拉通。

五、Prometheus 配置（scrape 段）

编辑 prometheus.yml，追加：

scrape_configs:  - job_name: 'pve8'    static_configs:      - targets: ['localhost:9221']   # exporter 地址    params:      target: ['localhost']           # 固定写 localhost 即可抓全集群    metrics_path: '/pve'    scrape_interval: 30s

重载 Prometheus：

curl -X POST http://localhost:9090/-/reload

打开 Prometheus UI → Status → Targets，看到 pve8 (1/1 up) 说明成功。

六、Grafana 看板（直接导入 10347）

1. Grafana → Import → ID 10347 （官方“Proxmox VE”仪表板，已适配 PVE 8 指标）。
2. 选择 Prometheus 数据源 → 立即出现 6 个面板：节点 CPU、VM 内存、存储空间、网络速率、备份日历、集群告警。
3. 建议把“存储剩余 < 15 %”和“VM 关机”两行告警勾上，通知渠道选企业微信/钉钉。

七、5 条核心告警规则（复制即用）

保存为 /etc/prometheus/rules/pve8.yml 并 reload：

groups:- name: pve8_alerts  rules:  - alert: PVE8_VmDown    expr: pve_up{id=~"qemu|lxc/.*"} == 0    for: 2m    annotations:      summary: "PVE8 VM {{ $labels.id }} 已关机或失联"  - alert: PVE8_NodeCpuHigh    expr: pve_cpu_usage_ratio{id=~"node/.*"} > 0.9    for: 5m    annotations:      summary: "PVE8 节点 {{ $labels.id }} CPU 使用率 > 90 %"  - alert: PVE8_StorageFull    expr: |      (pve_disk_usage_bytes{id=~"storage/.*"} /       pve_disk_size_bytes{id=~"storage/.*"}) > 0.85    for: 5m    annotations:      summary: "PVE8 存储 {{ $labels.id }} 剩余空间 < 15 %"  - alert: PVE8_MemLeak    expr: |      (pve_memory_usage_bytes{id=~"lxc|qemu/.*"} /       pve_memory_size_bytes{id=~"lxc|qemu/.*"}) > 0.95    for: 10m    annotations:      summary: "PVE8 VM {{ $labels.id }} 内存占用 > 95 %，可能泄露"  - alert: PVE8_BackupStale    expr: time() - pve_backup_last_end_timestamp > 86400    for: 0m    annotations:      summary: "PVE8 {{ $labels.id }} 超 24 h 未成功备份"

八、日常巡检 3 个 PromQL 速查

场景	查询
哪些 VM 关着？	`pve_up{id=~"qemu/.*"} == 0`
本周备份失败？	`time() - pve_backup_last_end_timestamp > 7*86400`
存储 7 天增长率？	`(pve_disk_usage_bytes - pve_disk_usage_bytes offset 7d) / 1024^3`

九、常见问题（PVE 8 实测）

问题	解决
exporter 报 401	token 写错，注意 `user@realm!tokenid` 格式
网络/磁盘 IO 全 0	VM 没装 VirtIO 驱动，换 `VirtIO-GPU` + `VirtIO-SCSI`
备份指标缺失	`collectors.backups: true` 没开
时间戳差 8 h	节点时区不一致，`timedatectl` 统一