04-调度控制

Kubernetes 资源管理、亲和性、污点容忍策略深度解析

学习目标

通过本模块学习，你将掌握：

Kubernetes 调度器工作原理
资源请求和限制管理
节点亲和性和 Pod 亲和性
污点和容忍机制
拓扑感知调度
调度故障排查技能

️ 一、调度器架构概览

调度器核心组件

┌─────────────────────────────────────────────────────────────┐
│                    Kubernetes Scheduler                    │
├─────────────────────────────────────────────────────────────┤
│  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐           │
│  │    Queue    │ │  Predicates │ │ Priorities  │           │
│  │   (队列)    │ │   (预选)    │ │   (优选)    │           │
│  └─────────────┘ └─────────────┘ └─────────────┘           │
│  ┌─────────────────────────────────────────────────────────┐ │
│  │                    Bind (绑定)                         │ │
│  └─────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘

调度流程详解

graph TD
    A[Pod 创建] --> B[加入调度队列]
    B --> C[预选阶段 Filter]
    C --> D[优选阶段 Score]
    D --> E[选择最佳节点]
    E --> F[绑定到节点]
    F --> G[kubelet 启动 Pod]

调度阶段说明

阶段	名称	功能	输出
Queue	调度队列	管理待调度 Pod	有序 Pod 列表
Filter	预选阶段	过滤不满足条件的节点	候选节点列表
Score	优选阶段	为候选节点打分	节点评分排序
Bind	绑定阶段	选择最高分节点并绑定	Pod 调度完成

二、资源管理机制

2.1 资源类型

CPU 资源

resources:
  requests:
    cpu: "500m"      # 0.5 CPU 核心
  limits:
    cpu: "1000m"     # 1 CPU 核心

内存资源

resources:
  requests:
    memory: "512Mi"  # 512 MB
  limits:
    memory: "1Gi"    # 1 GB

存储资源

resources:
  requests:
    ephemeral-storage: "1Gi"
  limits:
    ephemeral-storage: "2Gi"

GPU 资源

resources:
  requests:
    nvidia.com/gpu: 1
  limits:
    nvidia.com/gpu: 1

2.2 QoS 分类

Kubernetes 根据资源设置自动分类 Pod 的 QoS 等级：

QoS 类型	条件	优先级	说明
Guaranteed	requests == limits（CPU+内存都有）	最高	资源有保障
Burstable	requests < limits	中等	可突发使用
BestEffort	无 requests/limits	最低	尽力而为

QoS 示例

# Guaranteed QoS
apiVersion: v1
kind: Pod
metadata:
  name: guaranteed-pod
spec:
  containers:
  - name: app
    image: nginx
    resources:
      requests:
        cpu: "100m"
        memory: "128Mi"
      limits:
        cpu: "100m"
        memory: "128Mi"

# Burstable QoS
apiVersion: v1
kind: Pod
metadata:
  name: burstable-pod
spec:
  containers:
  - name: app
    image: nginx
    resources:
      requests:
        cpu: "100m"
        memory: "128Mi"
      limits:
        cpu: "200m"
        memory: "256Mi"

# BestEffort QoS
apiVersion: v1
kind: Pod
metadata:
  name: besteffort-pod
spec:
  containers:
  - name: app
    image: nginx
    # 无 resources 配置

2.3 资源配额管理

ResourceQuota（资源配额）

apiVersion: v1
kind: ResourceQuota
metadata:
  name: compute-quota
  namespace: production
spec:
  hard:
    requests.cpu: "4"
    requests.memory: 8Gi
    limits.cpu: "8"
    limits.memory: 16Gi
    pods: "10"
    persistentvolumeclaims: "4"

LimitRange（资源限制）

apiVersion: v1
kind: LimitRange
metadata:
  name: mem-limit-range
  namespace: production
spec:
  limits:
  - default:
      memory: "512Mi"
    defaultRequest:
      memory: "256Mi"
    type: Container
  - max:
      memory: "1Gi"
    min:
      memory: "128Mi"
    type: Container

三、节点选择策略

3.1 nodeSelector（节点选择器）

特点：精确匹配节点标签

apiVersion: v1
kind: Pod
metadata:
  name: nginx
spec:
  nodeSelector:
    disktype: ssd
    zone: us-west-1a
  containers:
  - name: nginx
    image: nginx

设置节点标签

# 添加标签
kubectl label node node1 disktype=ssd
kubectl label node node1 zone=us-west-1a

# 查看标签
kubectl get nodes --show-labels

# 删除标签
kubectl label node node1 disktype-

3.2 nodeAffinity（节点亲和性）

特点：支持复杂表达式匹配

硬约束（必须满足）

apiVersion: v1
kind: Pod
metadata:
  name: node-affinity-hard
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: disktype
            operator: In
            values: ["ssd", "nvme"]
          - key: zone
            operator: NotIn
            values: ["us-west-1c"]
  containers:
  - name: nginx
    image: nginx

软约束（优先考虑）

apiVersion: v1
kind: Pod
metadata:
  name: node-affinity-soft
spec:
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 50
        preference:
          matchExpressions:
          - key: zone
            operator: In
            values: ["us-west-1a"]
      - weight: 30
        preference:
          matchExpressions:
          - key: disktype
            operator: In
            values: ["ssd"]
  containers:
  - name: nginx
    image: nginx

操作符说明

操作符	说明	示例
In	值在列表中	`values: ["ssd", "nvme"]`
NotIn	值不在列表中	`values: ["hdd"]`
Exists	标签存在	无 values
DoesNotExist	标签不存在	无 values
Gt	值大于	`values: ["100"]`
Lt	值小于	`values: ["1000"]`

3.3 podAffinity（Pod 亲和性）

Pod 亲和（同节点部署）

apiVersion: v1
kind: Pod
metadata:
  name: web-pod
spec:
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values: ["redis"]
        topologyKey: kubernetes.io/hostname
  containers:
  - name: nginx
    image: nginx

Pod 反亲和（不同节点部署）

apiVersion: v1
kind: Pod
metadata:
  name: web-pod
spec:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values: ["web"]
        topologyKey: kubernetes.io/hostname
  containers:
  - name: nginx
    image: nginx

软亲和示例

apiVersion: v1
kind: Pod
metadata:
  name: web-pod
spec:
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: app
              operator: In
              values: ["web"]
          topologyKey: kubernetes.io/hostname
  containers:
  - name: nginx
    image: nginx

3.4 拓扑键（TopologyKey）

拓扑键	说明	使用场景
kubernetes.io/hostname	节点级别	同/不同节点
failure-domain.beta.kubernetes.io/zone	可用区级别	同/不同可用区
failure-domain.beta.kubernetes.io/region	地域级别	同/不同地域

四、污点和容忍机制

4.1 污点（Taints）

作用：阻止 Pod 调度到特定节点

设置污点

# 设置污点
kubectl taint nodes node1 key1=value1:NoSchedule
kubectl taint nodes node1 key2=value2:NoExecute
kubectl taint nodes node1 key3=value3:PreferNoSchedule

# 查看污点
kubectl describe node node1 | grep Taint

# 删除污点
kubectl taint nodes node1 key1=value1:NoSchedule-

污点效果说明

效果	说明	影响
NoSchedule	不调度新 Pod	新 Pod 不会调度到此节点
PreferNoSchedule	尽量不调度	优先调度到其他节点
NoExecute	驱逐现有 Pod	驱逐不匹配的现有 Pod

4.2 容忍（Tolerations）

作用：允许 Pod 调度到有污点的节点

基本容忍配置

apiVersion: v1
kind: Pod
metadata:
  name: tolerant-pod
spec:
  tolerations:
  - key: "key1"
    operator: "Equal"
    value: "value1"
    effect: "NoSchedule"
  containers:
  - name: nginx
    image: nginx

高级容忍配置

apiVersion: v1
kind: Pod
metadata:
  name: advanced-tolerant-pod
spec:
  tolerations:
  # 容忍所有污点
  - operator: "Exists"
  # 容忍特定键的所有污点
  - key: "key1"
    operator: "Exists"
  # 容忍特定污点（带时间）
  - key: "key2"
    operator: "Equal"
    value: "value2"
    effect: "NoExecute"
    tolerationSeconds: 300  # 300秒后驱逐
  containers:
  - name: nginx
    image: nginx

4.3 污点容忍实战场景

专用节点池

# 为 GPU 节点设置污点
kubectl taint nodes gpu-node-1 gpu=true:NoSchedule

# GPU Pod 配置容忍
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  tolerations:
  - key: "gpu"
    operator: "Equal"
    value: "true"
    effect: "NoSchedule"
  nodeSelector:
    gpu: "true"
  containers:
  - name: gpu-app
    image: nvidia/cuda:11.0-base
    resources:
      limits:
        nvidia.com/gpu: 1

维护模式

# 设置维护模式污点
kubectl taint nodes node1 maintenance=true:NoExecute

# 系统 Pod 配置容忍
apiVersion: v1
kind: Pod
metadata:
  name: system-pod
spec:
  tolerations:
  - key: "maintenance"
    operator: "Equal"
    value: "true"
    effect: "NoExecute"
    tolerationSeconds: 3600  # 1小时后驱逐
  containers:
  - name: system-app
    image: nginx

五、拓扑感知调度

5.1 TopologySpreadConstraints

作用：控制 Pod 在拓扑域中的分布

基本配置

apiVersion: v1
kind: Pod
metadata:
  name: topology-pod
spec:
  topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        app: web
  containers:
  - name: nginx
    image: nginx

高级配置

apiVersion: v1
kind: Pod
metadata:
  name: advanced-topology-pod
spec:
  topologySpreadConstraints:
  # 按可用区分布
  - maxSkew: 1
    topologyKey: failure-domain.beta.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        app: web
  # 按节点分布
  - maxSkew: 2
    topologyKey: kubernetes.io/hostname
    whenUnsatisfiable: ScheduleAnyway
    labelSelector:
      matchLabels:
        app: web
  containers:
  - name: nginx
    image: nginx

5.2 拓扑分布参数

参数	说明	可选值
maxSkew	最大偏差	正整数
topologyKey	拓扑键	节点标签键
whenUnsatisfiable	不满足时的行为	DoNotSchedule, ScheduleAnyway
labelSelector	标签选择器	匹配 Pod 标签

5.3 拓扑分布策略

均匀分布策略

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-deployment
spec:
  replicas: 6
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: zone
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: web
      containers:
      - name: nginx
        image: nginx

️ 六、命令速记

调度相关命令

# 查看节点资源
kubectl top nodes
kubectl describe nodes

# 查看 Pod 调度事件
kubectl get events --sort-by=.lastTimestamp

# 查看节点标签
kubectl get nodes --show-labels

# 设置节点标签
kubectl label node <node-name> <key>=<value>

# 设置节点污点
kubectl taint node <node-name> <key>=<value>:<effect>

资源管理命令

# 查看资源配额
kubectl get resourcequota
kubectl describe resourcequota <name>

# 查看资源限制
kubectl get limitrange
kubectl describe limitrange <name>

# 查看 Pod 资源使用
kubectl top pods
kubectl describe pod <pod-name>

调度调试命令

# 查看调度器日志
kubectl logs -n kube-system kube-scheduler-<node>

# 查看 Pod 调度详情
kubectl describe pod <pod-name> | grep -A 10 "Events"

# 模拟调度分析
kubectl get events --field-selector involvedObject.name=<pod-name>

七、面试核心问答

Q1: Kubernetes 调度器的工作原理？

答案要点：

预选阶段：过滤不满足条件的节点
优选阶段：为候选节点打分排序
绑定阶段：选择最高分节点并绑定
可扩展的调度框架

Q2: 如何实现 Pod 的亲和性调度？

答案要点：

nodeAffinity：节点亲和性
podAffinity：Pod 亲和性
podAntiAffinity：Pod 反亲和性
支持硬约束和软约束

Q3: 污点和容忍的作用是什么？

答案要点：

污点：阻止 Pod 调度到特定节点
容忍：允许 Pod 调度到有污点的节点
三种效果：NoSchedule、PreferNoSchedule、NoExecute
用于专用节点池和维护模式

Q4: 如何实现 Pod 的均匀分布？

答案要点：

使用 TopologySpreadConstraints
配置 maxSkew 控制偏差
选择合适的 topologyKey
支持多级拓扑分布

Q5: QoS 分类的意义是什么？

答案要点：

Guaranteed：资源有保障，优先级最高
Burstable：可突发使用，优先级中等
BestEffort：尽力而为，优先级最低
影响 OOM 杀死顺序和资源分配

八、故障排查

常见调度问题

1. Pod 处于 Pending 状态

# 查看 Pod 详情
kubectl describe pod <pod-name>

# 检查调度事件
kubectl get events --sort-by=.lastTimestamp

# 检查节点资源
kubectl top nodes

# 检查节点污点
kubectl describe node <node-name> | grep Taint

2. 资源不足导致调度失败

# 查看节点资源使用
kubectl top nodes
kubectl describe nodes

# 检查资源配额
kubectl get resourcequota
kubectl describe resourcequota <name>

# 检查 Pod 资源请求
kubectl describe pod <pod-name>

3. 亲和性约束冲突

# 检查节点标签
kubectl get nodes --show-labels

# 检查 Pod 标签
kubectl get pods --show-labels

# 检查亲和性配置
kubectl get pod <pod-name> -o yaml | grep -A 20 affinity

4. 污点容忍配置错误

# 检查节点污点
kubectl describe node <node-name> | grep Taint

# 检查 Pod 容忍配置
kubectl get pod <pod-name> -o yaml | grep -A 10 tolerations

# 验证容忍匹配
kubectl describe pod <pod-name>

九、最佳实践

调度策略建议

资源管理
- 合理设置 requests 和 limits
- 使用 ResourceQuota 限制资源使用
- 监控资源使用情况
节点管理
- 使用标签分类节点
- 合理设置污点
- 定期维护节点
亲和性设计
- 相关服务使用 Pod 亲和
- 高可用服务使用 Pod 反亲和
- 合理使用拓扑分布
性能优化
- 避免过度约束
- 使用软约束替代硬约束
- 监控调度性能

生产环境建议

高可用设计
- 多可用区部署
- 使用拓扑分布约束
- 配置 Pod 反亲和
资源优化
- 使用 VPA 自动调整资源
- 实施资源回收策略
- 监控资源使用效率
故障处理
- 配置 Pod 中断预算
- 使用污点进行维护
- 建立故障恢复流程

十、总结

通过本模块学习，你已经掌握了：

Kubernetes 调度器工作原理
资源管理和 QoS 分类
节点选择和亲和性策略
污点和容忍机制
拓扑感知调度
调度故障排查技能
调度最佳实践

下一步建议：继续学习 05-发布与弹性，深入了解 Kubernetes 应用发布和弹性伸缩机制。