当前位置: 首页 > news >正文

Prometheus介绍及监控平台部署

1. 核心架构概览plaintext┌─────────────────────────────────────────────────────────────────┐ │ Prometheus 架构 │ ├─────────────────────────────────────────────────────────────────┤ │ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │ │ │Exporter│ │Exporter│ │Exporter│ │Exporter│ (:9100/9104) │ │ └───┬────┘ └───┬────┘ └───┬────┘ └───┬────┘ │ │ └───────────┴───────────┴───────────┘ │ │ │ Pull (15s) │ │ ▼ │ │ ┌──────────────┐ │ │ │ Prometheus │──┐ │ │ │ Server │ │ ┌──────────────┐ │ │ │ ┌──────────┐ │ └───▶│ Alertmanager │──▶通知 │ │ │ │ TSDB │ │ └──────────────┘ │ │ │ └──────────┘ │ │ │ └──────┬───────┘ │ │ ▼ │ │ ┌──────────────┐ (:3000) │ │ │ Grafana │ │ │ └──────────────┘ │ └─────────────────────────────────────────────────────────────────┘流程Exporter暴露/metrics→ Prometheus定时Pull → TSDB存储 → Alertmanager告警 → Grafana展示2. 部署安装Docker Composeyaml# docker-compose.yml version: 3.8 services: prometheus: image: prom/prometheus:v2.47.0 container_name: prometheus restart: unless-stopped ports: - 9090:9090 volumes: - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml - ./prometheus/rules:/etc/prometheus/rules - prometheus_data:/prometheus command: - --config.file/etc/prometheus/prometheus.yml - --storage.tsdb.path/prometheus - --storage.tsdb.retention.time15d - --web.enable-lifecycle networks: - monitoring alertmanager: image: prom/alertmanager:v0.26.0 container_name: alertmanager restart: unless-stopped ports: - 9093:9093 volumes: - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml networks: - monitoring grafana: image: grafana/grafana:10.1.0 container_name: grafana restart: unless-stopped ports: - 3000:3000 environment: - GF_SECURITY_ADMIN_USERadmin - GF_SECURITY_ADMIN_PASSWORDadmin123 volumes: - grafana_data:/var/lib/grafana networks: - monitoring node-exporter: image: prom/node-exporter:v1.6.1 container_name: node-exporter restart: unless-stopped ports: - 9100:9100 command: - --path.procfs/host/proc - --path.sysfs/host/sys - --path.rootfs/host - --collector.filesystem.mount-points-exclude^/(sys|proc|dev|host|etc)($$|$) volumes: - /proc:/host/proc:ro - /sys:/host/sys:ro - /:/host:ro networks: - monitoring networks: monitoring: driver: bridge volumes: prometheus_data: grafana_data:3. 核心配置prometheus.yml详解yaml# prometheus.yml global: scrape_interval: 15s evaluation_interval: 15s external_labels: cluster: prod alerting: alertmanagers: - static_configs: - targets: [alertmanager:9093] rule_files: - rules/*.yml scrape_configs: - job_name: prometheus static_configs: - targets: [localhost:9090] - job_name: node static_configs: - targets: [node-exporter:9100] labels: env: prod - job_name: file_sd file_sd_configs: - files: - targets/*.json refresh_interval: 30s - job_name: kubernetes-pods kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true - job_name: relabel_demo static_configs: - targets: [192.168.1.100:8080] relabel_configs: - source_labels: [__address__] regex: ([^:]):(\d) target_label: instance replacement: ${1} - target_label: env replacement: prod - regex: __meta_.* action: labeldrop4. Exporter部署node_exporter安装bashwget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz tar xzf node_exporter-1.6.1.linux-amd64.tar.gz sudo mv node_exporter-1.6.1.linux-amd64/node_exporter /usr/local/bin/ sudo tee /etc/systemd/system/node-exporter.service EOF [Unit] DescriptionNode Exporter Afternetwork.target [Service] ExecStart/usr/local/bin/node_exporter Restarton-failure [Install] WantedBymulti-user.target EOF sudo systemctl daemon-reload sudo systemctl enable --now node-exporter常用Exporter表格Exporter端口监控目标关键指标node_exporter9100Linuxcpu/mem/disk/netwindows_exporter9182Windowsiis/sqlservermysql_exporter9104MySQLqueries/connectionspostgres_exporter9187PostgreSQLqueries/buffersredis_exporter9121Redismemory/commandsblackbox_exporter9115HTTP/TCPprobe_successcadvisor8080Dockercontainer_*5. PromQL查询基础promql# 即时向量 100 - (avg by (instance) (rate(node_cpu_seconds_total{modeidle}[5m])) * 100) # CPU使用率 100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) # 内存使用率 # 区间向量 rate(node_cpu_seconds_total{modeuser}[5m]) # 变化率 increase(http_requests_total[1h]) # 增量 # 聚合 sum by (instance, job) (rate(node_cpu_seconds_total[5m])) count(node_cpu_seconds_total) max by (service) (http_request_duration_seconds_bucket) # 函数 predict_linear(node_filesystem_free_bytes{mountpoint/}[1h], 4*3600) # 预测 irate(node_cpu_seconds_total{modeuser}[5m]) # 瞬时变化率 label_replace(up{jobnode}, hostname, $1, instance, ([^:]):.*)6. Alertmanager告警配置告警规则yaml# rules/alerts.yml groups: - name: node_alerts rules: - alert: NodeDown expr: up{jobnode} 0 for: 1m labels: severity: critical annotations: summary: 节点 {{ $labels.instance }} 宕机 - alert: HighCPU expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{modeidle}[5m])) * 100) 80 for: 5m labels: severity: warning annotations: summary: CPU使用率超过80%当前: {{ $value | printf \%.2f\ }}% - alert: LowMemory expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) 0.1 for: 3m labels: severity: warning - alert: DiskSpaceLow expr: (node_filesystem_avail_bytes{mountpoint/} / node_filesystem_size_bytes{mountpoint/}) 0.1 for: 2m labels: severity: criticalAlertmanager配置yaml# alertmanager/alertmanager.yml global: resolve_timeout: 5m smtp_smarthost: smtp.qq.com:587 smtp_from: alertexample.com smtp_auth_password: xxxxxx route: group_by: [alertname, cluster] group_wait: 30s group_interval: 5m repeat_interval: 4h receiver: default-receiver routes: - match: severity: critical receiver: critical-receiver group_wait: 10s receivers: - name: default-receiver email_configs: - to: opsexample.com send_resolved: true slack_configs: - channel: #alerts send_resolved: true - name: critical-receiver webhook_configs: - url: http://dingtalk:8060/dingtalk/webhook inhibit_rules: - source_match: severity: critical target_match: severity: warning equal: [instance]7. 常用命令与APIbash# 热加载配置 curl -X POST http://localhost:9090/-/reload # TSDB操作 curl http://localhost:9090/api/v1/status/tsdb curl -X POST -g http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]{jobtest} curl -X POST http://localhost:9090/api/v1/admin/tsdb/clean_tombstones # HTTP API curl -G http://localhost:9090/api/v1/query --data-urlencode queryup{jobnode} curl -G http://localhost:9090/api/v1/query_range \ --data-urlencode queryup{jobnode} \ --data-urlencode start2024-01-01T00:00:00Z \ --data-urlencode end2024-01-01T01:00:00Z \ --data-urlencode step60s curl http://localhost:9090/api/v1/targets curl http://localhost:9090/api/v1/alerts curl http://localhost:9090/api/v1/rules8. 常见问题排查问题1: Target Downbashcurl http://localhost:9090/api/v1/targets | jq .data.activeTargets[] | select(.healthdown) nc -zv target_ip port docker logs node-exporter curl http://target:9100/metrics问题2: 指标缺失bashcurl http://localhost:9090/api/v1/label/__name__/values | jq .data[] | grep -i metric curl -G http://localhost:9090/api/v1/query --data-urlencode querymetric_name_total curl http://localhost:9090/api/v1/targets | jq .data.activeTargets[].labels问题3: 告警不触发bashcurl http://localhost:9090/api/v1/rules | jq .data.groups[].rules[] | select(.typealerting) curl -s http://localhost:9093/api/v1/status curl http://localhost:9093/api/v1/silences表格排查命令用途up{jobxxx}确认target状态rate(x[5m]) 0验证指标存在ALERTS{alertnamexxx}检查告警状态promtool check config验证配置文件9. 最佳实践命名规范yaml# 指标名: 域_子系统_名称_单位 node_memory_Available_bytes http_request_duration_seconds # 标签: app_name, env, region, cluster, instance # 避免高基数标签user_id, ip等联邦集群yaml- job_name: federate metrics_path: /federate params: match[]: [{__name__~.}] static_configs: - targets: - prometheus-prod:9090 - prometheus-prod2:9090高可用plaintext┌─────────────┐ ┌─────────────┐ │ Prometheus │ │ Prometheus │ # 双写 │ Primary │ │ Replica │ └──────┬──────┘ └──────┬──────┘ └────────┬─────────┘ ▼ ┌──────────────┐ │Thanos Receiver│ # 统一存储 └──────────────┘远程存储yamlremote_write: - url: http://thanos-receive:19291/api/v1/receive queue_config: capacity: 10000 max_shards: 30 remote_read: - url: http://thanos-query:10912/api/v1/read read_recent: true性能优化标签基数控制: 避免超过10万标签组合抓取间隔: 高频5s低频60s记录规则: 预聚合复杂查询存储清理: 合理保留周期联邦分区: 按服务域拆分Prometheus
http://www.rkmt.cn/news/1391516.html

相关文章:

  • UVM静态函数(Static Function)用法详解
  • 怎样高效使用BepInEx插件框架:3步打造专业级游戏模组体验
  • 虚拟机无法获取ipv4地址
  • YOLOv5_OBB:面向旋转目标检测的工业级解决方案
  • Ubuntu 24.04 安装 Fcitx5 拼音输入法教程
  • 45天实测5个行业客户的GEO收录数据:前21天为零,改标题后达100%
  • GEO全攻略:从概念到选型,2026年五大头部GEO服务商深度测评 - 行业深度观察C
  • 初步理解 JVM:类加载机制、内存结构与核心运行原理
  • JMeter接口与压力测试实战:从连通性校验到性能瓶颈定位
  • 如何在CentOS 8中配置PostgreSQL 12流复制?
  • 【Lovable翻译平台开发实战指南】:20年资深架构师亲授高可用多语言系统设计心法
  • 2026新榜单:湘西母婴除甲醛CMA甲醛检测治理公司多少钱怎么收费 - 金诚回收
  • SteamDeck_rEFInd完全指南:Steam Deck双系统引导管理的终极解决方案
  • 2026巴州库尔勒纽恩泰空气能维修售卖全攻略:选型、落地、避坑一站式指南 - GrowthUME
  • 终极免费IDM激活指南:如何永久解锁完整功能(2024最新方案)
  • 生长因子——皮肤修复的“神奇工程师”
  • 噬菌体在肿瘤治疗中的研究进展:从抗菌到抗癌的跨界突破
  • JavaScript 调用 QQ 信息接口:头像直链和 QQ 空间链接展示
  • 5分钟彻底优化Windows 11:开源免费神器Win11Debloat终极指南
  • 直播抠图技术100谈之26---为什么做抠图一定要做美颜
  • 浙江成考别等报名才复习!提前多久准备才不慌? - 奔跑123
  • 2026新榜单:南平CMA甲醛检测治理及公共卫生检测报告地址联系方式集合(2026版) - 金诚回收
  • 基于深度信念网络的软件缺陷预测:从原理到工程实践
  • 游戏开发学习之路一——人物移动与旋转
  • Adam之后选哪个?浙大团队对23种优化器做了迄今最系统的评测
  • 企业级微信SDK深度解析:高性能Java集成的最佳实践
  • 3D模型版权保护:基于顶点曲率波动的鲁棒盲水印算法详解
  • 基于角间隔度量学习的标签推荐系统:从张量分解到自适应距离优化
  • Spring Boot集成Druid监控控制台:从“Sorry, you are not permitted”报错到精细化访问控制
  • ThinkPad T480黑苹果实现方案:OpenCore引导配置与macOS系统集成