1. 核心架构概览plaintext┌─────────────────────────────────────────────────────────────────┐ │ Prometheus 架构 │ ├─────────────────────────────────────────────────────────────────┤ │ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │ │ │Exporter│ │Exporter│ │Exporter│ │Exporter│ (:9100/9104) │ │ └───┬────┘ └───┬────┘ └───┬────┘ └───┬────┘ │ │ └───────────┴───────────┴───────────┘ │ │ │ Pull (15s) │ │ ▼ │ │ ┌──────────────┐ │ │ │ Prometheus │──┐ │ │ │ Server │ │ ┌──────────────┐ │ │ │ ┌──────────┐ │ └───▶│ Alertmanager │──▶通知 │ │ │ │ TSDB │ │ └──────────────┘ │ │ │ └──────────┘ │ │ │ └──────┬───────┘ │ │ ▼ │ │ ┌──────────────┐ (:3000) │ │ │ Grafana │ │ │ └──────────────┘ │ └─────────────────────────────────────────────────────────────────┘流程Exporter暴露/metrics→ Prometheus定时Pull → TSDB存储 → Alertmanager告警 → Grafana展示2. 部署安装Docker Composeyaml# docker-compose.yml version: 3.8 services: prometheus: image: prom/prometheus:v2.47.0 container_name: prometheus restart: unless-stopped ports: - 9090:9090 volumes: - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml - ./prometheus/rules:/etc/prometheus/rules - prometheus_data:/prometheus command: - --config.file/etc/prometheus/prometheus.yml - --storage.tsdb.path/prometheus - --storage.tsdb.retention.time15d - --web.enable-lifecycle networks: - monitoring alertmanager: image: prom/alertmanager:v0.26.0 container_name: alertmanager restart: unless-stopped ports: - 9093:9093 volumes: - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml networks: - monitoring grafana: image: grafana/grafana:10.1.0 container_name: grafana restart: unless-stopped ports: - 3000:3000 environment: - GF_SECURITY_ADMIN_USERadmin - GF_SECURITY_ADMIN_PASSWORDadmin123 volumes: - grafana_data:/var/lib/grafana networks: - monitoring node-exporter: image: prom/node-exporter:v1.6.1 container_name: node-exporter restart: unless-stopped ports: - 9100:9100 command: - --path.procfs/host/proc - --path.sysfs/host/sys - --path.rootfs/host - --collector.filesystem.mount-points-exclude^/(sys|proc|dev|host|etc)($$|$) volumes: - /proc:/host/proc:ro - /sys:/host/sys:ro - /:/host:ro networks: - monitoring networks: monitoring: driver: bridge volumes: prometheus_data: grafana_data:3. 核心配置prometheus.yml详解yaml# prometheus.yml global: scrape_interval: 15s evaluation_interval: 15s external_labels: cluster: prod alerting: alertmanagers: - static_configs: - targets: [alertmanager:9093] rule_files: - rules/*.yml scrape_configs: - job_name: prometheus static_configs: - targets: [localhost:9090] - job_name: node static_configs: - targets: [node-exporter:9100] labels: env: prod - job_name: file_sd file_sd_configs: - files: - targets/*.json refresh_interval: 30s - job_name: kubernetes-pods kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true - job_name: relabel_demo static_configs: - targets: [192.168.1.100:8080] relabel_configs: - source_labels: [__address__] regex: ([^:]):(\d) target_label: instance replacement: ${1} - target_label: env replacement: prod - regex: __meta_.* action: labeldrop4. Exporter部署node_exporter安装bashwget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz tar xzf node_exporter-1.6.1.linux-amd64.tar.gz sudo mv node_exporter-1.6.1.linux-amd64/node_exporter /usr/local/bin/ sudo tee /etc/systemd/system/node-exporter.service EOF [Unit] DescriptionNode Exporter Afternetwork.target [Service] ExecStart/usr/local/bin/node_exporter Restarton-failure [Install] WantedBymulti-user.target EOF sudo systemctl daemon-reload sudo systemctl enable --now node-exporter常用Exporter表格Exporter端口监控目标关键指标node_exporter9100Linuxcpu/mem/disk/netwindows_exporter9182Windowsiis/sqlservermysql_exporter9104MySQLqueries/connectionspostgres_exporter9187PostgreSQLqueries/buffersredis_exporter9121Redismemory/commandsblackbox_exporter9115HTTP/TCPprobe_successcadvisor8080Dockercontainer_*5. PromQL查询基础promql# 即时向量 100 - (avg by (instance) (rate(node_cpu_seconds_total{modeidle}[5m])) * 100) # CPU使用率 100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) # 内存使用率 # 区间向量 rate(node_cpu_seconds_total{modeuser}[5m]) # 变化率 increase(http_requests_total[1h]) # 增量 # 聚合 sum by (instance, job) (rate(node_cpu_seconds_total[5m])) count(node_cpu_seconds_total) max by (service) (http_request_duration_seconds_bucket) # 函数 predict_linear(node_filesystem_free_bytes{mountpoint/}[1h], 4*3600) # 预测 irate(node_cpu_seconds_total{modeuser}[5m]) # 瞬时变化率 label_replace(up{jobnode}, hostname, $1, instance, ([^:]):.*)6. Alertmanager告警配置告警规则yaml# rules/alerts.yml groups: - name: node_alerts rules: - alert: NodeDown expr: up{jobnode} 0 for: 1m labels: severity: critical annotations: summary: 节点 {{ $labels.instance }} 宕机 - alert: HighCPU expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{modeidle}[5m])) * 100) 80 for: 5m labels: severity: warning annotations: summary: CPU使用率超过80%当前: {{ $value | printf \%.2f\ }}% - alert: LowMemory expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) 0.1 for: 3m labels: severity: warning - alert: DiskSpaceLow expr: (node_filesystem_avail_bytes{mountpoint/} / node_filesystem_size_bytes{mountpoint/}) 0.1 for: 2m labels: severity: criticalAlertmanager配置yaml# alertmanager/alertmanager.yml global: resolve_timeout: 5m smtp_smarthost: smtp.qq.com:587 smtp_from: alertexample.com smtp_auth_password: xxxxxx route: group_by: [alertname, cluster] group_wait: 30s group_interval: 5m repeat_interval: 4h receiver: default-receiver routes: - match: severity: critical receiver: critical-receiver group_wait: 10s receivers: - name: default-receiver email_configs: - to: opsexample.com send_resolved: true slack_configs: - channel: #alerts send_resolved: true - name: critical-receiver webhook_configs: - url: http://dingtalk:8060/dingtalk/webhook inhibit_rules: - source_match: severity: critical target_match: severity: warning equal: [instance]7. 常用命令与APIbash# 热加载配置 curl -X POST http://localhost:9090/-/reload # TSDB操作 curl http://localhost:9090/api/v1/status/tsdb curl -X POST -g http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]{jobtest} curl -X POST http://localhost:9090/api/v1/admin/tsdb/clean_tombstones # HTTP API curl -G http://localhost:9090/api/v1/query --data-urlencode queryup{jobnode} curl -G http://localhost:9090/api/v1/query_range \ --data-urlencode queryup{jobnode} \ --data-urlencode start2024-01-01T00:00:00Z \ --data-urlencode end2024-01-01T01:00:00Z \ --data-urlencode step60s curl http://localhost:9090/api/v1/targets curl http://localhost:9090/api/v1/alerts curl http://localhost:9090/api/v1/rules8. 常见问题排查问题1: Target Downbashcurl http://localhost:9090/api/v1/targets | jq .data.activeTargets[] | select(.healthdown) nc -zv target_ip port docker logs node-exporter curl http://target:9100/metrics问题2: 指标缺失bashcurl http://localhost:9090/api/v1/label/__name__/values | jq .data[] | grep -i metric curl -G http://localhost:9090/api/v1/query --data-urlencode querymetric_name_total curl http://localhost:9090/api/v1/targets | jq .data.activeTargets[].labels问题3: 告警不触发bashcurl http://localhost:9090/api/v1/rules | jq .data.groups[].rules[] | select(.typealerting) curl -s http://localhost:9093/api/v1/status curl http://localhost:9093/api/v1/silences表格排查命令用途up{jobxxx}确认target状态rate(x[5m]) 0验证指标存在ALERTS{alertnamexxx}检查告警状态promtool check config验证配置文件9. 最佳实践命名规范yaml# 指标名: 域_子系统_名称_单位 node_memory_Available_bytes http_request_duration_seconds # 标签: app_name, env, region, cluster, instance # 避免高基数标签user_id, ip等联邦集群yaml- job_name: federate metrics_path: /federate params: match[]: [{__name__~.}] static_configs: - targets: - prometheus-prod:9090 - prometheus-prod2:9090高可用plaintext┌─────────────┐ ┌─────────────┐ │ Prometheus │ │ Prometheus │ # 双写 │ Primary │ │ Replica │ └──────┬──────┘ └──────┬──────┘ └────────┬─────────┘ ▼ ┌──────────────┐ │Thanos Receiver│ # 统一存储 └──────────────┘远程存储yamlremote_write: - url: http://thanos-receive:19291/api/v1/receive queue_config: capacity: 10000 max_shards: 30 remote_read: - url: http://thanos-query:10912/api/v1/read read_recent: true性能优化标签基数控制: 避免超过10万标签组合抓取间隔: 高频5s低频60s记录规则: 预聚合复杂查询存储清理: 合理保留周期联邦分区: 按服务域拆分Prometheus