Kubernetes节点亲和性与调度策略优化
Kubernetes节点亲和性与调度策略优化
引言
在Kubernetes集群中,Pod的调度是一个复杂的过程,涉及到资源分配、节点选择和负载均衡等多个方面。节点亲和性(Node Affinity)是Kubernetes提供的一种强大的调度机制,允许用户根据节点的标签和条件来控制Pod的调度位置。本文将深入探讨节点亲和性的原理、配置和优化策略。
一、节点调度概述
1.1 Kubernetes调度流程
Kubernetes调度器通过以下步骤将Pod调度到合适的节点:
- 过滤阶段:根据Pod的资源需求和约束条件筛选出可用节点
- 评分阶段:对筛选出的节点进行评分,选择最优节点
- 绑定阶段:将Pod绑定到选定的节点
1.2 调度约束类型
| 约束类型 | 描述 | 示例 |
|---|---|---|
| NodeSelector | 简单的节点标签匹配 | nodeSelector: { zone: us-west-2a } |
| NodeAffinity | 更灵活的节点亲和性规则 | 支持多种匹配条件 |
| PodAffinity | Pod间的亲和性约束 | 调度到同一节点或拓扑域 |
| PodAntiAffinity | Pod间的反亲和性约束 | 避免调度到同一节点 |
| Taints/Tolerations | 节点污点和容忍度 | 控制哪些Pod可以调度到节点 |
二、节点亲和性配置
2.1 NodeSelector配置
apiVersion: v1 kind: Pod metadata: name: my-app spec: nodeSelector: zone: us-west-2a instance-type: c5.large containers: - name: app image: my-app:latest2.2 NodeAffinity配置
apiVersion: v1 kind: Pod metadata: name: my-app spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: zone operator: In values: - us-west-2a - us-west-2b preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 preference: matchExpressions: - key: instance-type operator: In values: - c5.large - c5.xlarge - weight: 50 preference: matchExpressions: - key: disk-type operator: In values: - ssd containers: - name: app image: my-app:latest2.3 亲和性操作符
| 操作符 | 描述 |
|---|---|
| In | 标签值在指定列表中 |
| NotIn | 标签值不在指定列表中 |
| Exists | 标签存在(不关心值) |
| DoesNotExist | 标签不存在 |
| Gt | 标签值大于指定值(仅适用于数值型标签) |
| Lt | 标签值小于指定值(仅适用于数值型标签) |
三、Pod亲和性与反亲和性
3.1 PodAffinity配置
apiVersion: v1 kind: Pod metadata: name: my-app spec: affinity: podAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app operator: In values: - database topologyKey: kubernetes.io/hostname preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchExpressions: - key: team operator: In values: - backend topologyKey: topology.kubernetes.io/zone containers: - name: app image: my-app:latest3.2 PodAntiAffinity配置
apiVersion: v1 kind: Pod metadata: name: my-app spec: affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app operator: In values: - my-app topologyKey: kubernetes.io/hostname preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchExpressions: - key: team operator: In values: - frontend topologyKey: topology.kubernetes.io/zone containers: - name: app image: my-app:latest四、污点与容忍度
4.1 Taints配置
# 添加污点 kubectl taint nodes node-1 key=value:NoSchedule kubectl taint nodes node-2 key=value:PreferNoSchedule kubectl taint nodes node-3 key=value:NoExecute # 查看节点污点 kubectl describe node node-1 | grep Taints # 删除污点 kubectl taint nodes node-1 key=value:NoSchedule-4.2 Tolerations配置
apiVersion: v1 kind: Pod metadata: name: my-app spec: tolerations: - key: "key" operator: "Equal" value: "value" effect: "NoSchedule" - key: "key" operator: "Equal" value: "value" effect: "NoExecute" tolerationSeconds: 3600 containers: - name: app image: my-app:latest4.3 污点效果对比
| 效果 | 描述 |
|---|---|
| NoSchedule | 不调度没有对应容忍度的Pod |
| PreferNoSchedule | 尽量不调度没有对应容忍度的Pod |
| NoExecute | 不调度没有对应容忍度的Pod,并驱逐已存在的Pod |
五、调度策略优化
5.1 节点选择器优化
apiVersion: v1 kind: Pod metadata: name: my-app spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/os operator: In values: - linux - key: node-role.kubernetes.io/worker operator: Exists preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 preference: matchExpressions: - key: topology.kubernetes.io/zone operator: In values: - us-west-2a - weight: 50 preference: matchExpressions: - key: node.kubernetes.io/instance-type operator: In values: - c5.large - c5.xlarge containers: - name: app image: my-app:latest5.2 拓扑分布约束
apiVersion: apps/v1 kind: Deployment metadata: name: my-app spec: replicas: 6 selector: matchLabels: app: my-app template: metadata: labels: app: my-app spec: topologySpreadConstraints: - maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: app: my-app - maxSkew: 1 topologyKey: kubernetes.io/hostname whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: app: my-app containers: - name: app image: my-app:latest5.3 资源感知调度
apiVersion: v1 kind: Pod metadata: name: my-app spec: containers: - name: app image: my-app:latest resources: requests: cpu: "1" memory: "1Gi" limits: cpu: "2" memory: "2Gi" affinity: nodeAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 preference: matchExpressions: - key: node.kubernetes.io/instance-type operator: In values: - c5.xlarge - c5.2xlarge六、调度器配置
6.1 自定义调度器配置
apiVersion: v1 kind: ConfigMap metadata: name: custom-scheduler-config data: config.yaml: | apiVersion: kubescheduler.config.k8s.io/v1beta3 kind: KubeSchedulerConfiguration schedulerName: custom-scheduler profiles: - schedulerName: custom-scheduler pluginConfig: - name: NodeResources args: scoringStrategy: type: LeastAllocated resources: - name: cpu weight: 1 - name: memory weight: 16.2 使用自定义调度器
apiVersion: apps/v1 kind: Deployment metadata: name: my-app spec: replicas: 3 selector: matchLabels: app: my-app template: metadata: labels: app: my-app spec: schedulerName: custom-scheduler containers: - name: app image: my-app:latest七、调度监控与优化
7.1 调度指标监控
apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: kube-scheduler-monitor spec: selector: matchLabels: component: kube-scheduler endpoints: - port: http path: /metrics interval: 30s7.2 调度告警规则
apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: scheduler-alerts spec: groups: - name: scheduler.rules rules: - alert: HighPendingPods expr: sum(kube_pod_status_phase{phase="Pending"}) > 10 for: 5m labels: severity: warning annotations: summary: "High number of pending pods" description: "{{ $value }} pods are pending" - alert: SchedulerHighLatency expr: histogram_quantile(0.99, sum(rate(scheduler_e2e_scheduling_duration_seconds_bucket[5m])) by (le)) > 10 for: 5m labels: severity: warning annotations: summary: "High scheduler latency" description: "99th percentile scheduling latency is {{ $value }}s"八、最佳实践
8.1 节点标签管理
# 添加节点标签 kubectl label nodes node-1 zone=us-west-2a kubectl label nodes node-1 instance-type=c5.large kubectl label nodes node-1 role=worker # 查看节点标签 kubectl get nodes -L zone,instance-type,role # 删除节点标签 kubectl label nodes node-1 zone-8.2 混合云调度策略
apiVersion: v1 kind: Pod metadata: name: my-app spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: cloud.provider operator: In values: - aws - gcp preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 preference: matchExpressions: - key: cloud.provider operator: In values: - aws containers: - name: app image: my-app:latest8.3 调度优先级配置
apiVersion: scheduling.k8s.io/v1 kind: PriorityClass metadata: name: high-priority value: 1000000 description: "High priority pods" --- apiVersion: scheduling.k8s.io/v1 kind: PriorityClass metadata: name: low-priority value: 1000 description: "Low priority pods" --- apiVersion: v1 kind: Pod metadata: name: my-app spec: priorityClassName: high-priority containers: - name: app image: my-app:latest九、总结
节点亲和性是Kubernetes调度机制的重要组成部分,通过合理配置节点亲和性、Pod亲和性和污点容忍度,可以实现精细的调度控制。
在实际生产环境中,建议根据业务需求制定合理的调度策略,结合拓扑分布约束和资源感知调度,提高集群的资源利用率和应用的可用性。同时,建立完善的调度监控体系,及时发现和处理调度问题。
