一ã€ä¸€æ¬¡"狼æ¥äº†"事件给我的教è®
2018年,我ç»åŽ†äº†ä¸€æ¬¡è®©æˆ‘åˆ»éª¨é“心的事æ•
。
那天凌晨2点,手机疯狂震动,告è¦çŸä¿¡ä¸€æ¡æŽ¥ä¸€æ¡ï¼š"Redis连接数è¶
é™"ã€"æ•°æ®åº“CPU 99%"ã€"接å£å“应时间è¶
过5ç§’"。我爬起æ¥ä¸€çœ‹ï¼Œå‘Šè¦ç³»ç»Ÿæ˜¾ç¤ºç³»ç»Ÿå·²ç»æŒ‚了。
我赶紧爬起æ¥å¤„ç†ï¼Œç»“æžœå‘现——什么问题都没有。Redisæ£å¸¸ï¼Œæ•°æ®åº“æ£å¸¸ï¼ŒæŽ¥å£å“åº”æ—¶é—´åªæœ‰å‡ 忝«ç§’。
ç¬¬äºŒå¤©ä¸€é—®ï¼ŒåŽŸæ¥æ˜¯è¿ç»´åŒå¦åœ¨å‡Œæ™¨1点åšäº†ä¸€æ¬¡æ•°æ®åº“维护,触å‘了大é‡çš„临时告è¦ï¼Œç„¶åŽè¿™äº›å‘Šè¦åœ¨2点集ä¸å‘é€å‡ºæ¥ã€‚但这些告è¦éƒ½æ˜¯æ— 效告è¦â€”—系统在维护期间本æ¥å°±æ˜¯ä¸æ£å¸¸çš„。
从那以åŽï¼Œæˆ‘们团队开始认真æ€è€ƒå‘Šè¦ä½“ç³»çš„è®¾è®¡ï¼šä»€ä¹ˆæ ·çš„å‘Šè¦æ‰æ˜¯çœŸæ£æœ‰ä»·å€¼çš„?
äºŒã€æŒ‡æ ‡ä½“系设计:让系统å¯è§
åšç›‘控,首å
ˆè¦æ¸
楚监控什么。业界有两个ç»å
¸çš„监控方法论:RED方法和USE方法。
2.1 RED方法(é¢å‘æœåŠ¡ï¼‰
é€‚ç”¨äºŽæ— çŠ¶æ€æœåŠ¡ï¼ˆå¦‚HTTP API):
- Rate:请求速率(QPS/TPS)
- Error:错误率
- Duration:å“应时间分布(p50/p90/p99)
2.2 USE方法(é¢å‘资æºï¼‰
适用于系统资æºï¼ˆå¦‚CPUã€å†
å˜ã€ç£ç›˜ï¼‰ï¼š
- Utilization:利用率
- Saturation:饱和度
- Errors:错误数
2.3 æˆ‘ä»¬çš„æŒ‡æ ‡ä½“ç³»
æˆ‘ä»¬æœ€ç»ˆè®¾è®¡çš„æŒ‡æ ‡ä½“ç³»å¦‚ä¸‹ï¼š
# å¾®æœåŠ¡æŒ‡æ ‡é‡‡é›†é置(Prometheusæ ¼å¼ï¼‰# ä¸šåŠ¡æŒ‡æ ‡app_business:order_count:type:counterdescription:订å•创建数é‡labels:[service,status]payment_amount:type:counterdescription:支付金é¢labels:[service,payment_type]active_users:type:gaugedescription:活跃用户数labels:[service]# HTTPæŒ‡æ ‡http_requests:total:type:counterdescription:HTTP请求总数labels:[method,path,status]duration_seconds:type:histogramdescription:HTTPå“应时间labels:[method,path]buckets:[0.005,0.01,0.025,0.05,0.1,0.25,0.5,1,2.5,5,10]# JVMæŒ‡æ ‡jvm:memory_used_bytes:type:gaugedescription:JVMå·²ä½¿ç”¨å† å˜labels:[area,service]gc_count:type:counterdescription:GC次数labels:[gc_type]thread_count:type:gaugedescription:活跃线程数labels:[thread_type]# æ•°æ®åº“æŒ‡æ ‡database:connections:type:gaugedescription:æ•°æ®åº“连接数labels:[pool_name]query_duration_seconds:type:histogramdescription:SQL执行时间labels:[operation,table]# ç¼“å˜æŒ‡æ ‡redis:commands_total:type:counterdescription:Redis命令总数labels:[command,status]keyspace_keys:type:gaugedescription:Keyæ•°é‡labels:[db]memory_used_bytes:type:gaugedescription:Redisä½¿ç”¨å† å˜2.4 å
³é”®å‘Šè¦é˜ˆå€¼è®¾è®¡
# AlertManager告è¦è§„则éç½®groups:-name:business_alertsrules:-alert:HighErrorRateexpr:|sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05for:5mlabels:severity:criticalannotations:summary:"æœåŠ¡{{ $labels.service }}错误率过高"description:"5åˆ†é’Ÿå† é”™è¯¯çŽ‡è¶ è¿‡5%,当å‰å€¼ï¼š{{$value|printf \"%.2f\"}}%"-alert:OrderCountDropexpr:|sum(increase(app_business_order_count[10m])) < 100for:5mlabels:severity:warningannotations:summary:"è®¢å•æ•°é‡å¼‚常下é™"description:"最近10åˆ†é’Ÿè®¢å•æ•°å°‘于100å•,å¯èƒ½å˜åœ¨ä¸šåŠ¡é—®é¢˜"-name:infrastructure_alertsrules:-alert:HighCPUUsageexpr:|100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80for:10mlabels:severity:warningannotations:summary:"CPU使用率过高"description:"æœåС噍{{$labels.instance}}CPUä½¿ç”¨çŽ‡è¶ è¿‡80%"-alert:JVMHeapMemoryHighexpr:|jvm_memory_used_bytes{area="heap"} / jvm_memory_max_bytes{area="heap"} > 0.85for:5mlabels:severity:warningannotations:summary:"JVMå †å† å˜ä½¿ç”¨çŽ‡è¿‡é«˜"description:"æœåŠ¡{{$labels.service}}å †å† å˜ä½¿ç”¨çŽ‡è¶ è¿‡85%"-alert:DatabaseConnectionPoolExhaustedexpr:|datasource_connections_active / datasource_connections_max > 0.9for:2mlabels:severity:criticalannotations:summary:"æ•°æ®åº“è¿žæŽ¥æ± å³å°†è€—å°½"description:"{{$labels.pool_name}}è¿žæŽ¥æ± ä½¿ç”¨çŽ‡è¶ è¿‡90%"三ã€Grafana大盘设计
å
‰æœ‰å‘Šè¦è¿˜ä¸å¤Ÿï¼Œè¿˜éœ€è¦å¯è§†åŒ–大盘让团队对系统状æ€ä¸€ç›®äº†ç„¶ã€‚
3.1 大盘分层设计
┌─────────────────────────────────────────────────────────────────┠│ ç³»ç»Ÿæ¦‚è§ˆå¤§å± â”‚ ├──────────────┬──────────────┬──────────────┬───────────────────┤ │ 请求QPS │ 错误率 │ å¹³å‡å“应时间 │ 在线用户数 │ │ â–“â–“â–“â–“â–“â–‘â–‘â–‘ │ 0.12% ✠│ 45ms ✠│ 12,345 │ ├──────────────┴──────────────┴──────────────┴───────────────────┤ │ 儿œåŠ¡å¥åº·çŠ¶æ€ â”‚ ├──────────────┬──────────────┬──────────────┬───────────────────┤ │ è®¢å•æœåŠ¡ │ 支付æœåŠ¡ │ 用户æœåŠ¡ │ 商哿œåŠ¡ │ │ â— å¥åº· 42ms │ â— å¥åº· 38ms│ â— å¥åº· 25ms │ â— å¥åº· 55ms │ ├──────────────┴──────────────┴──────────────┴───────────────────┤ │ åŸºç¡€è®¾æ–½çŠ¶æ€ â”‚ ├──────────────┬──────────────┬──────────────┬───────────────────┤ │ CPU: 45% │ å† å˜: 62% │ ç£ç›˜: 78% │ 网络: æ£å¸¸ │ │ â–“â–“â–“â–“â–‘â–‘â–‘â–‘â–‘ │ â–“â–“â–“â–“â–“â–“â–‘â–‘â–‘ │ â–“â–“â–“â–“â–“â–“â–“â–“â–‘â–‘ │ â— æ£å¸¸ │ └──────────────┴──────────────┴──────────────┴───────────────────┘3.2 Grafana Dashboard JSONé
ç½®
{"dashboard":{"title":"å¾®æœåŠ¡ç›‘æŽ§å¤§ç›˜","tags":["microservice","production"],"timezone":"Asia/Shanghai","panels":[{"title":"æœåŠ¡è¯·æ±‚QPS","type":"graph","gridPos":{"x":0,"y":0,"w":12,"h":8},"targets":[{"expr":"sum(rate(http_requests_total{service=~\"$service\"}[1m])) by (service)","legendFormat":"{{service}}"}],"alert":{"name":"QPS异常告è¦","conditions":[{"evaluator":{"params":[10],"type":"lt"},"operator":{"type":"and"},"query":{"params":["A","5m","now"]},"reducer":{"type":"avg"}}],"frequency":"1m","noDataState":"no_data"}},{"title":"æœåŠ¡é”™è¯¯çŽ‡","type":"stat","gridPos":{"x":12,"y":0,"w":6,"h":4},"targets":[{"expr":"sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])) * 100"}],"fieldConfig":{"defaults":{"thresholds":{"mode":"absolute","steps":[{"color":"green","value":null},{"color":"yellow","value":1},{"color":"red","value":5}]},"unit":"percent","decimals":2}}},{"title":"JVMå †å† å˜ä½¿ç”¨çއ","type":"gauge","gridPos":{"x":18,"y":0,"w":6,"h":4},"targets":[{"expr":"jvm_memory_used_bytes{area=\"heap\"} / jvm_memory_max_bytes{area=\"heap\"} * 100"}],"fieldConfig":{"defaults":{"thresholds":{"mode":"percentage","steps":[{"color":"green","value":null},{"color":"yellow","value":70},{"color":"red","value":85}]},"unit":"percent","max":100}}},{"title":"æ•°æ®åº“è¿žæŽ¥æ± ","type":"graph","gridPos":{"x":0,"y":8,"w":12,"h":8},"targets":[{"expr":"datasource_connections_active{pool_name=~\"$pool\"}","legendFormat":"活跃连接"},{"expr":"datasource_connections_idle{pool_name=~\"$pool\"}","legendFormat":"空闲连接"},{"expr":"datasource_connections_max{pool_name=~\"$pool\"}","legendFormat":"最大连接"}]}]}}å››ã€å‘Šè¦åˆ†çº§ä¸Žæ”¶æ•›
4.1 告è¦åˆ†çº§ç–ç•¥
我们把告è¦åˆ†ä¸º5个级别:
| 级别 | åç§° | 定义 | å“应时间 | é€šçŸ¥æ–¹å¼ |
|---|---|---|---|---|
| P0 | 最高 | æ ¸å¿ƒä¸šåŠ¡å®Œå | ||
| ¨ä¸å¯ç”¨ | 5åˆ†é’Ÿå† | |||
| ç”µè¯ + çŸä¿¡ + 钉钉 | ||||
| P1 | 高 | æ ¸å¿ƒåŠŸèƒ½å—æŸ | 15åˆ†é’Ÿå† | |
| çŸä¿¡ + 钉钉 | ||||
| P2 | ä¸ | éžæ ¸å¿ƒåŠŸèƒ½å¼‚å¸¸ | 1å°æ—¶å† | |
| 钉钉 | ||||
| P3 | 低 | 潜在风险 | å·¥ä½œæ—¶é—´å¤„ç† | 邮件 |
| P4 | æç¤º | æ— å | ||
| ³ç´§è¦çš„æç¤º | ä¸å¤„ç† | 日志 |
4.2 å‘Šè¦æ”¶æ•›ç–ç•¥
告è¦é£Žæš´æ˜¯æœ€å¤§çš„æ•Œäººã€‚我们使用AlertManager的分组和抑制功能:
# AlertManageréç½®global:resolve_timeout:5msmtp_smarthost:'smtp.example.com:587'smtp_from:'alert@example.com'# 告è¦è·¯ç”±éç½®route:group_by:['alertname','cluster','service']group_wait:30s# ç‰å¾30秒分组group_interval:5m# æ¯5分钟å‘é€ä¸€æ¬¡åˆ†ç»„告è¦repeat_interval:4h# é‡å¤å‘Šè¦é—´éš”4å°æ—¶receiver:'default'routes:# P0/P1告è¦ç«‹å³å‘é€-match:severity:criticalreceiver:'critical-alerts'group_wait:0srepeat_interval:1h# P2å‘Šè¦æ”¶æ•›åŽå‘é€-match:severity:warningreceiver:'warning-alerts'group_wait:1mrepeat_interval:4h# 按æœåŠ¡åˆ†ç»„-match:service:order-servicereceiver:'order-team'routes:-match:severity:criticalreceiver:'order-team-critical'group_wait:0sreceivers:-name:'critical-alerts'# 电è¯é€šçŸ¥ï¼ˆä½¿ç”¨è¾è®¯äº‘ç”µè¯æœåŠ¡ï¼‰webhook_configs:-url:'http://alert-phone.example.com/call'send_resolved:true# 钉钉通知webhook_configs:-url:'http://dingtalk.example.com/webhook'send_resolved:trueheaders:Content-Type:application/jsonmax_alerts:10# 邮件通知email_configs:-to:'oncall@example.com'send_resolved:true-name:'warning-alerts'webhook_configs:-url:'http://dingtalk.example.com/webhook-warning'# åªå‘钉钉,ä¸å‘电è¯å’Œé‚®ä»¶# å‘Šè¦æŠ‘åˆ¶è§„åˆ™inhibit_rules:# 当æœåŠ¡å™¨å®•æœºæ—¶ï¼ŒæŠ‘åˆ¶è¯¥æœåŠ¡å™¨ä¸Šæ‰€æœ‰æœåŠ¡çš„æ‰€æœ‰å‘Šè¦-source_match:alertname:'ServerDown'source_labels:[instance]target_match_re:alertname:'.*'target_labels:instance:'{{ $value }}'equal:[cluster]# 当整个集群ä¸å¯ç”¨æ—¶ï¼ŒæŠ‘制该集群上所有æœåŠ¡çš„æ‰€æœ‰å‘Šè¦-source_match:alertname:'ClusterDown'source_labels:[cluster]target_match_re:alertname:'.*'target_labels:cluster:'{{ $value }}'equal:[namespace]五ã€è¸©å‘实录:告è¦ä½“系的血泪教è®
å‘1:告è¦é£Žæš´å¯¼è‡´ç‹¼æ¥äº†æ•ˆåº”
这是我们踩过最大的å‘。
有一次,数æ®åº“主从切æ¢ï¼Œè§¦å‘äº†å‡ ç™¾æ¡å‘Šè¦ã€‚è¿ç»´äººå‘˜è¢«æ·¹æ²¡åœ¨å‘Šè¦çš„æµ·æ´‹é‡Œï¼Œé”™è¿‡äº†çœŸæ£é‡è¦çš„告è¦â€”—应用æœåС噍ç£ç›˜æ»¡äº†ã€‚
解决:
- 使用AlertManager的
group_by对åŒç±»å‘Šè¦è¿›è¡Œèšåˆ - 设置åˆç†çš„
group_wait(30秒),é¿å
告è¦ç¢Žç‰‡åŒ– - é
置抑制规则,上游æ•
障时抑制下游告è¦
å‘2:误告è¦é€ æˆä¸å¿
è¦çš„紧急å“应
æœ‰äº›ç›‘æŽ§æŒ‡æ ‡æœ¬èº«æœ‰æ¯›åˆºï¼ˆæ¯”å¦‚çž¬æ—¶CPU飙å‡ï¼‰ï¼Œä½†æˆ‘ä»¬çš„å‘Šè¦æ²¡æœ‰è®¾ç½®åˆç†çš„for时长,导致瞬时波动就触å‘告è¦ã€‚
解决:
- 所有告è¦å¿
须设置for时长(通常5分钟),é¿å
瞬时波动触å‘å‘Šè¦ - å
³é”®æŒ‡æ ‡å¢žåŠ æ¸è¿›å‘Šè¦ï¼šå
ˆwarning,过更长时间å†critical - 定期Review告è¦è§„则,æ¸
ç†æ— 效告è¦
å‘3ï¼šå¤œé—´å‘Šè¦æ²¡äººç®¡
有一次,P2级别的告è¦åœ¨å‡Œæ™¨3点触å‘ï¼Œä½†åªæœ‰ä¸€ä¸ªäººæ”¶åˆ°äº†é€šçŸ¥ï¼Œè€Œè¿™ä¸ªäººç¡ç€äº†ã€‚第二天早上æ‰å‘现问题。
解决:
- 建立值ç制度,明确æ¯ä¸ªæ—¶é—´æ®µçš„值ç人
- P0/P1告è¦å¿
须电è¯é€šçŸ¥ï¼Œä¸”需è¦ç¡®è®¤æ”¶åˆ° - P2告è¦åœ¨éžå·¥ä½œæ—¶é—´å‘é€ç»™å€¼çäººï¼Œä¸æ‰“扰å
¶ä»–人
å‘4:告è¦é€šçŸ¥æ¸ é“å•一
我们最开始åªç”¨é‚®ä»¶é€šçŸ¥å‘Šè¦ï¼Œç»“æžœå‘现:
- ç´§æ€¥å‘Šè¦æ²¡äººçœ‹é‚®ä»¶
- 邮件延迟导致å“应ä¸åŠæ—¶
è§£å†³ï¼šå»ºç«‹å¤šæ¸ é“告è¦ï¼š
- 电è¯ï¼šP0级别,å¿
é¡»æŽ¥å¬ - çŸä¿¡ï¼šP0/P1级别
- 钉钉/飞书群:所有级别
- 邮件:P3/P4级别,ä»
供记录
å
ã€ä¸šåŠ¡åœºæ™¯ï¼šæŸé‡‘èžå
¬å¸æå»ºå®Œæ•´å¯è§‚测体系的完整过程
这家å
¬å¸ï¼ˆæˆ‘们å«ä»–Aå
¬å¸ï¼‰æ˜¯ä¸€å®¶åšæ¶ˆè´¹é‡‘èžçš„创业å
¬å¸ã€‚2020年,他们从零开始æå»ºå¯è§‚测体系。
ç¬¬ä¸€é˜¶æ®µï¼šåªæœ‰åŸºç¡€ç›‘控(2020å¹´Q1)
å½“æ—¶ä»–ä»¬çš„ç›‘æŽ§çŠ¶æ€æ˜¯ï¼š
- åªæœ‰æœåŠ¡å™¨åŸºç¡€ç›‘æŽ§ï¼ˆCPUã€å†
å˜ã€ç£ç›˜ï¼‰ - 没有任何应用层监控
- 告è¦åªæœ‰é‚®ä»¶
- æ¯å¤©æ—©ä¸Šçœ‹ä¸€æ¬¡ç›‘控大盘
问题:ç»å¸¸æ”¶åˆ°ç”¨æˆ·æŠ•诉"系统æ
¢äº†",但开å‘团队完å
¨ä¸çŸ¥é“å‘生了什么。
第二阶段:引å
¥APM(2020å¹´Q2)
引å
¥äº†SkyWalking APM,æå»ºäº†é“¾è·¯è¿½è¸ªèƒ½åŠ›ï¼š
- 看到了æœåŠ¡é—´çš„è°ƒç”¨å
³ç³» - å‘现了大é‡çš„æ
¢SQL - 看到了æ¯ä¸ªæŽ¥å£çš„å“应时间分布
改善:能定ä½é—®é¢˜äº†ï¼Œä½†è¿˜æ˜¯è¢«åŠ¨â€”â€”æ€»æ˜¯åœ¨å‡ºé—®é¢˜åŽæ‰çŸ¥é“。
第三阶段:建立完整å¯è§‚测体系(2020å¹´Q3-Q4)
å»ºç«‹äº†å®Œæ•´çš„ä¸‰æ¿æ–§å¯è§‚测体系:
日志:ELK Stack
- 结构化日志(JSONæ ¼å¼ï¼‰
- TraceID贯穿所有日志
- 日志èšåˆå’Œæœç´¢
æŒ‡æ ‡ï¼šPrometheus + Grafana
- RED方法覆盖所有HTTPæœåŠ¡
- USE方法覆盖所有基础设施
- è‡ªå®šä¹‰ä¸šåŠ¡æŒ‡æ ‡
链路:SkyWalking
- å
¨é“¾è·¯è¿½è¸ª - 拓扑图自动生æˆ
- æ
¢æœåŠ¡åˆ†æž
告è¦ï¼šAlertManager + 钉钉
- 分级告è¦ï¼ˆP0-P4)
- 告è¦èšåˆå’Œæ”¶æ•›
- å€¼çæœºåˆ¶
æˆæžœï¼š
- MTTD(平å‡å‘现时间)从4å°æ—¶ç¼©çŸåˆ°5分钟
- MTTRï¼ˆå¹³å‡æ¢å¤æ—¶é—´ï¼‰ä»Ž2å°æ—¶ç¼©çŸåˆ°30分钟
- 用户投诉"系统æ
¢"的数é‡ä¸‹é™äº†80%
ä¸ƒã€æ€»ç»“与æ€è€ƒ
监控告è¦ä½“系建设的å
³é”®è¦ç‚¹ï¼š
- **分层监控**:基础设施层ã€åº”用层ã€ä¸šåŠ¡å±‚éƒ½è¦è¦†ç›–
- é»„é‡‘æŒ‡æ ‡ï¼šå»¶è¿Ÿã€æµé‡ã€é”™è¯¯çއã€é¥±å’Œåº¦æ˜¯æ ¸å¿ƒ
- **告è¦åˆ†çº§**ï¼šä¸æ˜¯æ‰€æœ‰å‘Šè¦éƒ½ä¸€æ ·é‡è¦ï¼Œè¦åˆ†çº§å¤„ç†
- å‘Šè¦æ”¶æ•›ï¼šé¿å
告è¦é£Žæš´æ·¹æ²¡çœŸæ£é‡è¦çš„å‘Šè¦ - **å€¼çæœºåˆ¶**:确ä¿å‘Šè¦æœ‰äººå“应,ä¸èƒ½çŸ³æ²‰å¤§æµ·
- æŒç»ä¼˜åŒ–:定期Review告è¦è§„则,æ¸
ç†æ— 效告è¦
血的教è®ï¼š
告è¦ä½“ç³»çš„æ ¸å¿ƒä¸æ˜¯"å‘现所有问题",而是"å‘现真æ£éœ€è¦äººå·¥ä»‹å
¥çš„问题"。告è¦å¤ªå¤šå’Œå‘Šè¦å¤ªå°‘åŒæ ·æœ‰å®³ã€‚
ç»™ä½ çš„æ€è€ƒé¢˜ï¼š
- ä½ ä»¬å›¢é˜Ÿçš„å‘Šè¦ä½“系有没有"狼æ¥äº†"的问题?
- 如果åŠå¤œæ”¶åˆ°ä¸€ä¸ªP2告è¦ï¼Œä½ 会怎么处ç†ï¼Ÿ
个人观点,ä»
ä¾›å‚考