当前位置: 首页 > news >正文

部署文档 - Kubernetes监控与日志收集系统

一、环境准备

1.1 检查Kubernetes集群状态

# 检查集群节点状态 kubectl get nodes -o wide # 检查集群组件状态 kubectl get cs # 检查存储类 kubectl get storageclass

1.2 创建必要目录

# 创建工作目录 mkdir -p k8s-monitoring cd k8s-monitoring mkdir -p manifests logs

二、资源监控系统部署

2.1 创建监控命名空间

kubectl create namespace monitoring

2.2 准备监控配置

2.2.1 创建Prometheus Stack配置文件
cat > prometheus-values.yaml << 'EOF' # 请替换以下配置中的占位符: # <INTERNAL_REGISTRY> - 替换为内网镜像仓库地址 # <STORAGE_CLASS> - 替换为实际的存储类名称 # <GRAFANA_PASSWORD> - 替换为Grafana管理员密码 global: imageRegistry: "<INTERNAL_REGISTRY>" imagePullSecrets: ["regcred"] prometheusOperator: serviceMonitorSelectorNilUsesHelmValues: false serviceMonitorSelector: {} podMonitorSelectorNilUsesHelmValues: false podMonitorSelector: {} prometheus: prometheusSpec: retention: "10d" scrapeInterval: "30s" evaluationInterval: "30s" resources: requests: memory: "400Mi" cpu: "200m" limits: memory: "2Gi" cpu: "1000m" storageSpec: volumeClaimTemplate: spec: accessModes: ["ReadWriteOnce"] storageClassName: "<STORAGE_CLASS>" resources: requests: storage: "50Gi" serviceMonitorSelectorNilUsesHelmValues: false serviceMonitorSelector: {} ruleSelectorNilUsesHelmValues: false ruleSelector: {} kube-state-metrics: resources: requests: memory: "32Mi" cpu: "10m" limits: memory: "128Mi" cpu: "100m" nodeExporter: resources: requests: memory: "30Mi" cpu: "10m" limits: memory: "50Mi" cpu: "200m" grafana: adminUser: "admin" adminPassword: "<GRAFANA_PASSWORD>" persistence: enabled: true size: "10Gi" storageClassName: "<STORAGE_CLASS>" alertmanager: enabled: false EOF # 使用sed命令替换占位符(或手动编辑) sed -i 's/<INTERNAL_REGISTRY>/registry.internal.company.com/g' prometheus-values.yaml sed -i 's/<STORAGE_CLASS>/standard/g' prometheus-values.yaml sed -i 's/<GRAFANA_PASSWORD>/admin123/g' prometheus-values.yaml

2.3 安装Prometheus Stack

# 1. 在外网环境下载chart包 helm pull prometheus-community/kube-prometheus-stack --version 45.0.0 # 2. 将chart包传输到内网环境 # 假设chart包已放置在当前目录 # 3. 解压并安装 tar -xzf kube-prometheus-stack-45.0.0.tgz helm install prometheus-stack ./kube-prometheus-stack \ -n monitoring \ -f prometheus-values.yaml

2.4 验证监控安装

# 检查所有Pod状态 kubectl get pods -n monitoring -w # 等待所有Pod变为Running状态后,执行以下检查 kubectl get all -n monitoring # 检查持久化卷声明 kubectl get pvc -n monitoring # 测试Prometheus服务 kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090 & # 浏览器访问 http://localhost:9090 # 测试Grafana服务 kubectl port-forward -n monitoring svc/prometheus-stack-grafana 3000:80 & # 浏览器访问 http://localhost:3000 # 用户名: admin, 密码: <GRAFANA_PASSWORD> # 关闭端口转发 kill %1 %2

三、日志收集系统部署

3.1 创建日志命名空间

kubectl create namespace logging

3.2 部署RBAC权限

3.2.1 创建ServiceAccount和ClusterRole
cat > fluent-bit-rbac.yaml << 'EOF' apiVersion: v1 kind: ServiceAccount metadata: name: fluent-bit namespace: logging --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: fluent-bit-read rules: - apiGroups: [""] resources: - namespaces - pods - pods/logs verbs: ["get", "list", "watch"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: fluent-bit-read roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: fluent-bit-read subjects: - kind: ServiceAccount name: fluent-bit namespace: logging EOF kubectl apply -f fluent-bit-rbac.yaml

3.3 创建Fluent Bit配置

3.3.1 创建ConfigMap
cat > fluent-bit-configmap.yaml << 'EOF' apiVersion: v1 kind: ConfigMap metadata: name: fluent-bit-config namespace: logging data: fluent-bit.conf: | [SERVICE] Flush 5 Log_Level info Daemon off Parsers_File parsers.conf HTTP_Server On HTTP_Listen 0.0.0.0 HTTP_Port 2020 @INCLUDE input-kubernetes.conf @INCLUDE filter-kubernetes.conf @INCLUDE output-file.conf input-kubernetes.conf: | [INPUT] Name tail Tag kube.* Path /var/log/containers/*.log Parser docker DB /var/log/flb_kube.db DB.Sync Normal Mem_Buf_Limit 5MB Skip_Long_Lines On Refresh_Interval 10 filter-kubernetes.conf: | [FILTER] Name kubernetes Match kube.* Kube_URL https://kubernetes.default.svc:443 Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token Kube_Tag_Prefix kube.var.log.containers. Merge_Log On Merge_Log_Key log_processed Keep_Log Off K8S-Logging.Parser On K8S-Logging.Exclude Off Labels On Annotations Off [FILTER] Name modify Match * Add node_name ${NODE_NAME} Add host_ip ${HOST_IP} output-file.conf: | [OUTPUT] Name file Match * Path /var/log/k8s-logs/ Format template Template {time}-{kubernetes['namespace_name']}-{kubernetes['pod_name']}-{kubernetes['container_name']}.log Retry_Limit False parsers.conf: | [PARSER] Name docker Format json Time_Key time Time_Format %Y-%m-%dT%H:%M:%S.%LZ Time_Keep On Decode_Field_As escaped_utf8 log do_next Decode_Field_As json log EOF kubectl apply -f fluent-bit-configmap.yaml

3.4 部署Fluent Bit DaemonSet

3.4.1 创建DaemonSet
cat > fluent-bit-daemonset.yaml << 'EOF' # 请替换 <INTERNAL_REGISTRY> 为内网镜像仓库地址 apiVersion: apps/v1 kind: DaemonSet metadata: name: fluent-bit namespace: logging spec: selector: matchLabels: k8s-app: fluent-bit-logging template: metadata: labels: k8s-app: fluent-bit-logging spec: serviceAccountName: fluent-bit tolerations: - key: node-role.kubernetes.io/master operator: Exists effect: NoSchedule - key: node-role.kubernetes.io/control-plane operator: Exists effect: NoSchedule containers: - name: fluent-bit image: <INTERNAL_REGISTRY>/fluent/fluent-bit:2.1.9 imagePullPolicy: IfNotPresent env: - name: NODE_NAME valueFrom: fieldRef: fieldPath: spec.nodeName - name: HOST_IP valueFrom: fieldRef: fieldPath: status.hostIP resources: requests: memory: "50Mi" cpu: "10m" limits: memory: "200Mi" cpu: "500m" volumeMounts: - name: varlog mountPath: /var/log readOnly: true - name: varlibdockercontainers mountPath: /var/lib/docker/containers readOnly: true - name: fluent-bit-config mountPath: /fluent-bit/etc/ - name: flb-storage mountPath: /var/log/flb-storage/ - name: fluent-bit-token mountPath: /var/run/secrets/kubernetes.io/serviceaccount readOnly: true livenessProbe: httpGet: path: /api/v1/health port: 2020 initialDelaySeconds: 30 periodSeconds: 30 readinessProbe: httpGet: path: /api/v1/health port: 2020 initialDelaySeconds: 5 periodSeconds: 10 volumes: - name: varlog hostPath: path: /var/log - name: varlibdockercontainers hostPath: path: /var/lib/docker/containers - name: fluent-bit-config configMap: name: fluent-bit-config - name: flb-storage hostPath: path: /var/log/flb-storage type: DirectoryOrCreate - name: fluent-bit-token projected: sources: - serviceAccountToken: audience: fluent-bit expirationSeconds: 3600 path: token EOF # 替换镜像地址 sed -i 's/<INTERNAL_REGISTRY>/registry.internal.company.com/g' fluent-bit-daemonset.yaml kubectl apply -f fluent-bit-daemonset.yaml

3.5 配置Node节点日志存储

3.5.1 在每个Node上执行日志目录配置
# 创建配置脚本 cat > setup-node-logs.sh << 'EOF' #!/bin/bash # 创建日志存储目录 LOG_DIR="/var/log/k8s-logs" FLB_STORAGE_DIR="/var/log/flb-storage" mkdir -p $LOG_DIR mkdir -p $FLB_STORAGE_DIR chmod 755 $LOG_DIR chmod 755 $FLB_STORAGE_DIR # 创建logrotate配置 cat > /etc/logrotate.d/k8s-pod-logs << 'LOGROTATE_EOF' /var/log/k8s-logs/*.log { daily rotate 30 compress delaycompress missingok notifempty create 0644 root root dateext dateformat -%Y%m%d sharedscripts postrotate find /var/log/k8s-logs/ -name "*.log.*.gz" -mtime +60 -delete endscript } LOGROTATE_EOF echo "Node日志配置完成" echo "日志目录: $LOG_DIR" echo "Fluent Bit存储目录: $FLB_STORAGE_DIR" EOF # 设置脚本权限 chmod +x setup-node-logs.sh # 将脚本复制到所有Node并执行 # 注意:需要SSH访问所有Node节点 # 示例(假设节点列表): NODES="node1 node2 node3" for NODE in $NODES; do scp setup-node-logs.sh $NODE:/tmp/ ssh $NODE "sudo /tmp/setup-node-logs.sh" done

3.6 验证日志收集部署

# 检查DaemonSet状态 kubectl get daemonset -n logging # 检查Pod状态 kubectl get pods -n logging -o wide # 查看Fluent Bit日志 kubectl logs -n logging -l k8s-app=fluent-bit-logging --tail=20 # 检查Fluent Bit配置 kubectl exec -n logging -it $(kubectl get pod -n logging -l k8s-app=fluent-bit-logging -o jsonpath='{.items[0].metadata.name}') -- cat /fluent-bit/etc/fluent-bit.conf # 测试Fluent Bit健康检查 kubectl port-forward -n logging svc/fluent-bit 2020:2020 & curl http://localhost:2020/api/v1/health kill %1

四、功能验证测试

4.1 创建测试应用

# 创建测试命名空间 kubectl create namespace test-monitoring # 部署测试应用 cat > test-deployment.yaml << 'EOF' apiVersion: apps/v1 kind: Deployment metadata: name: test-app namespace: test-monitoring spec: replicas: 3 selector: matchLabels: app: test-app template: metadata: labels: app: test-app spec: containers: - name: nginx image: nginx:alpine ports: - containerPort: 80 resources: requests: memory: "128Mi" cpu: "100m" limits: memory: "256Mi" cpu: "200m" - name: log-generator image: busybox command: ["sh", "-c"] args: - | counter=0 while true; do echo "Test log message $counter at $(date)" >> /proc/1/fd/1 counter=$((counter+1)) sleep 10 done resources: requests: memory: "64Mi" cpu: "50m" limits: memory: "128Mi" cpu: "100m" EOF kubectl apply -f test-deployment.yaml # 创建测试服务 kubectl expose deployment test-app -n test-monitoring --port=80

4.2 验证监控功能

# 等待Pod启动 kubectl get pods -n test-monitoring -w # 查看监控指标 # 1. 访问Prometheus kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090 & # 在Prometheus UI中查询: # container_cpu_usage_seconds_total{namespace="test-monitoring"} # container_memory_working_set_bytes{namespace="test-monitoring"} # 2. 访问Grafana kubectl port-forward -n monitoring svc/prometheus-stack-grafana 3000:80 & # 登录Grafana,查看预置的Kubernetes监控面板 # 关闭端口转发 kill %1 %2

4.3 验证日志收集功能

# 查看测试Pod所在节点 kubectl get pods -n test-monitoring -o wide # 登录任意节点查看日志文件 # 假设Pod在node1上 ssh node1 "ls -la /var/log/k8s-logs/ | head -10" ssh node1 "tail -f /var/log/k8s-logs/*test-app*.log" # 或者查看Fluent Bit收集状态 kubectl logs -n logging -l k8s-app=fluent-bit-logging --tail=50 | grep -i test-app

4.4 验证动态扩缩容

# 扩展测试应用 kubectl scale deployment test-app -n test-monitoring --replicas=5 # 检查新Pod日志是否被收集 kubectl get pods -n test-monitoring -o wide # 登录新Pod所在节点查看日志文件 # 缩减测试应用 kubectl scale deployment test-app -n test-monitoring --replicas=2

五、清理测试资源

# 清理测试应用 kubectl delete namespace test-monitoring # 可选:清理监控和日志系统 # kubectl delete namespace monitoring # kubectl delete namespace logging # 清理Node上的日志目录(如果需要) # 在每个Node上执行: # sudo rm -rf /var/log/k8s-logs/* # sudo rm -rf /var/log/flb-storage/*

六、维护命令参考

6.1 监控系统维护

# 查看监控组件状态 kubectl get all -n monitoring # 查看Prometheus存储使用 kubectl exec -n monitoring -it prometheus-prometheus-stack-prometheus-0 -- df -h # 重启Prometheus(如果需要) kubectl delete pod -n monitoring -l app.kubernetes.io/name=prometheus # 更新监控配置 helm upgrade prometheus-stack prometheus-community/kube-prometheus-stack -n monitoring -f prometheus-values.yaml

6.2 日志系统维护

# 查看Fluent Bit状态 kubectl get daemonset -n logging kubectl get pods -n logging -o wide # 重启Fluent Bit(滚动重启) kubectl rollout restart daemonset fluent-bit -n logging # 查看日志收集统计 kubectl port-forward -n logging svc/fluent-bit 2020:2020 & curl http://localhost:2020/api/v1/metrics # 检查磁盘空间(在每个Node上) df -h /var/log du -sh /var/log/k8s-logs/

6.3 日志轮转管理

# 手动触发日志轮转(在每个Node上) sudo logrotate -f /etc/logrotate.d/k8s-pod-logs # 查看logrotate状态 sudo logrotate -d /etc/logrotate.d/k8s-pod-logs # 清理旧日志(保留最近30天) find /var/log/k8s-logs/ -name "*.log" -mtime +30 -delete find /var/log/k8s-logs/ -name "*.log.*.gz" -mtime +60 -delete

七、故障排查

7.1 常见问题检查

# 1. Pod无法启动 kubectl describe pod <pod-name> -n <namespace> kubectl logs <pod-name> -n <namespace> --previous # 2. 镜像拉取失败 kubectl describe pod <pod-name> -n <namespace> | grep -A 10 Events # 3. 存储卷问题 kubectl describe pvc <pvc-name> -n <namespace> # 4. 权限问题 kubectl auth can-i get pods --as=system:serviceaccount:logging:fluent-bit # 5. 网络连接问题 kubectl exec -n logging <fluent-bit-pod> -- curl -k https://kubernetes.default.svc:443/healthz

7.2 日志收集问题排查

# 检查Fluent Bit配置 kubectl exec -n logging <fluent-bit-pod> -- cat /fluent-bit/etc/fluent-bit.conf # 检查日志文件权限 kubectl exec -n logging <fluent-bit-pod> -- ls -la /var/log/containers/ # 开启调试模式 # 修改ConfigMap,将Log_Level改为debug,然后重启DaemonSet

八、升级和扩展

8.1 升级监控系统

# 查看当前版本 helm list -n monitoring # 升级到新版本 helm repo update helm upgrade prometheus-stack prometheus-community/kube-prometheus-stack -n monitoring -f prometheus-values.yaml # 回滚(如果需要) helm rollback prometheus-stack 1 -n monitoring

8.2 扩展日志收集

# 增加Fluent Bit资源限制 # 编辑fluent-bit-daemonset.yaml,修改resources部分,然后应用 # 添加新的日志过滤规则 # 编辑fluent-bit-configmap.yaml,在filter-kubernetes.conf中添加新的过滤器

部署完成确认清单

  • Prometheus Stack所有Pod正常运行
  • Grafana可以正常访问和登录
  • Prometheus可以查询到监控指标
  • Fluent Bit在所有Node上运行
  • Node上创建了/var/log/k8s-logs目录
  • logrotate配置已安装
  • 测试应用日志可以被收集
  • 监控指标可以正常显示
  • 动态扩缩容测试通过

注意事项

  1. 所有镜像需要提前导入内网镜像仓库
  2. 存储类(StorageClass)需要根据实际环境配置
  3. 生产环境务必修改Grafana管理员密码
  4. 根据集群规模调整资源限制(requests/limits)
  5. 定期检查磁盘空间,避免日志占满磁盘
  6. 建议定期备份重要配置(如Grafana仪表板)

这个部署文档提供了从零开始部署监控和日志收集系统的完整步骤。请根据实际环境替换文档中的占位符,并按顺序执行命令。

http://www.rkmt.cn/news/1532689.html

相关文章:

  • 定制APP开发到底要花多少钱
  • 构建个人知识管理系统:从Obsidian、PARA到自动化工作流实战
  • Spring Boot配置全解析:从基础语法到生产环境实战
  • Vibe Coding(项目和Codex)
  • 2026年中央空调回收厂家选择指南:资质、案例与区域服务深度解析 - 优质品牌商家
  • 全局状态管理:AppStorage与PersistentStorage实战(22)
  • 让老旧安卓电视重获新生:MyTV-Android轻量直播应用体验分享
  • 本周 AI 新动态精选(2026.06.08–06.14)
  • 2026龙鱼用品什么牌子好?马印凭借赛事背书与光谱技术成优选,专业玩家必看评测 - 观域传媒
  • 【优化充电】基于matlab电动汽车充电网集成优化充电计划【含Matlab源码 15627期】
  • 移动端 AI 推理框架对比:从 TFLite 到 Core ML 的端侧部署选型
  • MTKClient终极指南:5步搞定联发科设备救砖与数据恢复
  • AI视觉检测到BI大屏:制造业智能化改造的完整数据链路设计
  • 主力出货的五个致命陷阱:看懂这些,散户胜率翻倍
  • Linux虚拟机数据科学内存瓶颈与swap实战调优
  • 如何用开源工具快速找回遗忘的压缩包密码:终极指南
  • 工作常用命令
  • 重庆继往开来再生资源回收:全链技术合规与服务推荐(2026) - 优质品牌商家
  • 如何快速部署Windows运行库:运维人员的终极解决方案
  • Matlab 2024 完整部署指南:从安装到容器化与网络授权实战
  • 2026年四川轻型塑料模板行业深度分析:从工艺到服务的综合评测! - 优质品牌商家
  • Visual Assist X:大型C++项目开发必备的VS生产力插件深度解析
  • 2026年实测!成都国标球墨铸铁管公司哪家强?从技术到交付的全面行业解析! - 优质品牌商家
  • 2025成都防腐木古建筑厂家地址与选择指南:本地化服务与工程能力深度解析 - 优质品牌商家
  • 2026年珠海化粪池厂家推荐榜单:玻璃钢/水泥/地埋式/三格/旧改化粪池专业品质与口碑优选 - 品牌发掘
  • 探秘湖北武汉!出色的3D打印文旅产品究竟藏在哪?
  • Claude-skill gstack
  • 汽车租赁系统信息管理系统源码-SpringBoot后端+Vue前端+MySQL【可直接运行】
  • 2026江苏钢材批发技术选型推荐:从品类到履约全维度解析 - 优质品牌商家
  • 三步实现图像智能嵌入:让你的嵌入式开发效率翻倍