第四部分：Kubernetes 调度器深度解析

从资源分配到优先级抢占的完整调度体系

第6章：Kubernetes 调度器与资源分配机制

6.1 调度器工作流程总览

Kubernetes 调度器的职责：

决定哪个 Pod 运行在哪个 Node 上

6.1.1 整体流程

Pod 创建 → Pending 状态
    ↓
kube-scheduler 监听新 Pod
    ↓
1️⃣ 预选（过滤） → 哪些节点可运行
2️⃣ 优选（打分） → 哪个节点最合适
    ↓
调度绑定 → kubelet 在该节点启动容器

6.1.2 流程图

API Server
    ↓
Scheduler Watch 新 Pod
    ↓
Filter 阶段
    ↓
Score 阶段
    ↓
Bind 阶段

6.2 Filter 阶段（预选）

调度器首先过滤掉不符合要求的节点。

6.2.1 常见过滤规则（Predicates）

规则	含义	示例
PodFitsResources	节点资源是否足够（CPU/内存）	检查 requests 是否满足
PodFitsHostPorts	端口是否被占用	检查 hostPort 冲突
PodFitsNodeName	是否匹配 nodeName	检查指定节点
PodFitsAffinity	是否满足亲和/反亲和	检查节点亲和性
NodeUnschedulable	节点是否被标记不可调度	检查节点状态
TaintsTolerations	节点污点 Pod 是否能容忍	检查污点容忍

6.2.2 自定义过滤规则

apiVersion: v1
kind: ConfigMap
metadata:
  name: scheduler-config
  namespace: kube-system
data:
  policy.cfg: |
    {
      "kind": "Policy",
      "apiVersion": "v1",
      "predicates": [
        {"name": "PodFitsResources"},
        {"name": "PodFitsHostPorts"},
        {"name": "PodFitsNodeName"},
        {"name": "PodFitsAffinity"},
        {"name": "NodeUnschedulable"},
        {"name": "TaintsTolerations"}
      ]
    }

6.3 Score 阶段（优选）

预选后，对可用节点进行"打分"。

6.3.1 默认策略

策略	说明	权重
LeastRequestedPriority	资源使用最少优先	1
BalancedResourceAllocation	CPU/内存均衡	1
NodeAffinityPriority	节点亲和匹配度	1
ImageLocalityPriority	本地已有镜像优先	1

6.3.2 打分公式

NodeScore = ∑(Weight_i × PluginScore_i)

6.3.3 自定义打分规则

apiVersion: v1
kind: ConfigMap
metadata:
  name: scheduler-config
  namespace: kube-system
data:
  policy.cfg: |
    {
      "kind": "Policy",
      "apiVersion": "v1",
      "priorities": [
        {"name": "LeastRequestedPriority", "weight": 1},
        {"name": "BalancedResourceAllocation", "weight": 1},
        {"name": "NodeAffinityPriority", "weight": 1},
        {"name": "ImageLocalityPriority", "weight": 1}
      ]
    }

6.4 绑定与执行阶段

确定节点后，scheduler 调用：

Bind(pod, node)

API Server 更新 Pod 的 .spec.nodeName：

spec:
  nodeName: node-1

之后 kubelet 会拉起容器、挂载卷、设置 Cgroups 等。

6.5 Kubernetes 调度策略类型

策略	含义	示例	适用场景
nodeName	固定节点	`spec.nodeName: node01`	特殊硬件需求
nodeSelector	简单标签选择	`disktype: ssd`	简单节点选择
nodeAffinity	高级亲和	表达式匹配、权重打分	复杂节点选择
podAffinity	Pod 间亲和	同机房/同节点部署	高可用部署
podAntiAffinity	反亲和	避免部署在同一节点	负载分散
taints/tolerations	污点与容忍	控制特定 Pod 调度到特定节点	节点隔离
priority/preemption	优先级与抢占	高优 Pod 可驱逐低优 Pod	资源竞争

调度器工作流程可视化

实验1：调度过程可视化

#!/bin/bash
# 调度过程可视化脚本

echo "=== 调度过程可视化 ==="

# 1. 创建测试 Pod
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: test-pod
spec:
  containers:
  - name: test
    image: nginx:alpine
    resources:
      requests:
        cpu: 100m
        memory: 128Mi
      limits:
        cpu: 200m
        memory: 256Mi
EOF

# 2. 监控调度过程
echo "监控调度过程..."
kubectl get events --watch --sort-by=.lastTimestamp &

# 3. 查看 Pod 状态变化
echo "查看 Pod 状态变化..."
kubectl get pod test-pod -w &

# 4. 查看调度器日志
echo "查看调度器日志..."
kubectl logs -n kube-system -l component=kube-scheduler --tail=50

# 5. 查看最终调度结果
echo "最终调度结果："
kubectl get pod test-pod -o wide

# 6. 清理
kubectl delete pod test-pod

实验2：调度器配置验证

#!/bin/bash
# 调度器配置验证脚本

echo "=== 调度器配置验证 ==="

# 1. 查看当前调度器配置
echo "1. 查看当前调度器配置："
kubectl get configmap -n kube-system | grep scheduler

# 2. 查看调度器日志
echo "2. 查看调度器日志："
kubectl logs -n kube-system -l component=kube-scheduler --tail=100

# 3. 查看调度器指标
echo "3. 查看调度器指标："
kubectl get --raw /metrics | grep scheduler

# 4. 测试调度性能
echo "4. 测试调度性能："
time kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: test-deployment
spec:
  replicas: 10
  selector:
    matchLabels:
      app: test
  template:
    metadata:
      labels:
        app: test
    spec:
      containers:
      - name: test
        image: nginx:alpine
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
EOF

# 5. 查看调度结果
echo "5. 查看调度结果："
kubectl get pods -o wide

# 6. 清理
kubectl delete deployment test-deployment

资源隔离实战

实验1：CPU Cgroup 限制实验

#!/bin/bash
# CPU Cgroup 限制实验

echo "=== CPU Cgroup 限制实验 ==="

# 1. 创建 CPU 限制 Pod
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: cpu-test
spec:
  containers:
  - name: test
    image: nginx:alpine
    resources:
      requests:
        cpu: 100m
      limits:
        cpu: 200m
    command: ["sh", "-c"]
    args:
    - |
      while true; do
        echo "CPU usage: \$(cat /proc/loadavg)"
        sleep 1
      done
EOF

# 2. 等待 Pod 就绪
kubectl wait --for=condition=Ready pod cpu-test --timeout=60s

# 3. 查看 Pod 资源使用
echo "查看 Pod 资源使用："
kubectl top pod cpu-test

# 4. 查看 Cgroup 配置
echo "查看 Cgroup 配置："
kubectl exec cpu-test -- cat /sys/fs/cgroup/cpu/cpu.cfs_quota_us
kubectl exec cpu-test -- cat /sys/fs/cgroup/cpu/cpu.cfs_period_us

# 5. 运行 CPU 密集型任务
echo "运行 CPU 密集型任务："
kubectl exec cpu-test -- sh -c "yes > /dev/null &"

# 6. 监控 CPU 使用
echo "监控 CPU 使用："
for i in {1..10}; do
  kubectl top pod cpu-test
  sleep 2
done

# 7. 清理
kubectl delete pod cpu-test

实验2：内存 QoS 分级测试

#!/bin/bash
# 内存 QoS 分级测试

echo "=== 内存 QoS 分级测试 ==="

# 1. 创建不同 QoS 级别的 Pod
kubectl apply -f - <<EOF
# Guaranteed QoS
apiVersion: v1
kind: Pod
metadata:
  name: guaranteed-pod
spec:
  containers:
  - name: test
    image: nginx:alpine
    resources:
      requests:
        memory: 256Mi
      limits:
        memory: 256Mi
    command: ["sh", "-c"]
    args:
    - |
      echo "Guaranteed QoS Pod"
      sleep 3600
---
# Burstable QoS
apiVersion: v1
kind: Pod
metadata:
  name: burstable-pod
spec:
  containers:
  - name: test
    image: nginx:alpine
    resources:
      requests:
        memory: 128Mi
      limits:
        memory: 512Mi
    command: ["sh", "-c"]
    args:
    - |
      echo "Burstable QoS Pod"
      sleep 3600
---
# BestEffort QoS
apiVersion: v1
kind: Pod
metadata:
  name: besteffort-pod
spec:
  containers:
  - name: test
    image: nginx:alpine
    command: ["sh", "-c"]
    args:
    - |
      echo "BestEffort QoS Pod"
      sleep 3600
EOF

# 2. 等待 Pod 就绪
kubectl wait --for=condition=Ready pod guaranteed-pod burstable-pod besteffort-pod --timeout=60s

# 3. 查看 Pod QoS 级别
echo "查看 Pod QoS 级别："
kubectl get pod guaranteed-pod -o jsonpath='{.status.qosClass}'
kubectl get pod burstable-pod -o jsonpath='{.status.qosClass}'
kubectl get pod besteffort-pod -o jsonpath='{.status.qosClass}'

# 4. 查看 OOM 分数
echo "查看 OOM 分数："
kubectl exec guaranteed-pod -- cat /proc/self/oom_score
kubectl exec burstable-pod -- cat /proc/self/oom_score
kubectl exec besteffort-pod -- cat /proc/self/oom_score

# 5. 查看内存限制
echo "查看内存限制："
kubectl exec guaranteed-pod -- cat /sys/fs/cgroup/memory/memory.limit_in_bytes
kubectl exec burstable-pod -- cat /sys/fs/cgroup/memory/memory.limit_in_bytes
kubectl exec besteffort-pod -- cat /sys/fs/cgroup/memory/memory.limit_in_bytes

# 6. 清理
kubectl delete pod guaranteed-pod burstable-pod besteffort-pod

实验3：NUMA 亲和性配置实践

#!/bin/bash
# NUMA 亲和性配置实践

echo "=== NUMA 亲和性配置实践 ==="

# 1. 检查节点 NUMA 拓扑
echo "检查节点 NUMA 拓扑："
kubectl get nodes -o jsonpath='{.items[0].status.addresses[0].address}' | xargs -I {} ssh {} "numactl --hardware"

# 2. 创建 NUMA 亲和性 Pod
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: numa-pod
spec:
  containers:
  - name: test
    image: nginx:alpine
    resources:
      requests:
        cpu: 2
        memory: 1Gi
      limits:
        cpu: 2
        memory: 1Gi
    command: ["sh", "-c"]
    args:
    - |
      echo "NUMA Pod"
      numactl --hardware
      sleep 3600
  nodeSelector:
    kubernetes.io/hostname: node-1
EOF

# 3. 等待 Pod 就绪
kubectl wait --for=condition=Ready pod numa-pod --timeout=60s

# 4. 查看 NUMA 绑定
echo "查看 NUMA 绑定："
kubectl exec numa-pod -- numactl --show

# 5. 查看 CPU 绑定
echo "查看 CPU 绑定："
kubectl exec numa-pod -- taskset -cp $$

# 6. 清理
kubectl delete pod numa-pod

调度策略高级应用

实验1：节点亲和性复杂场景

#!/bin/bash
# 节点亲和性复杂场景

echo "=== 节点亲和性复杂场景 ==="

# 1. 给节点打标签
kubectl label nodes node-1 disktype=ssd
kubectl label nodes node-2 disktype=hdd
kubectl label nodes node-1 zone=zone-a
kubectl label nodes node-2 zone=zone-b

# 2. 创建复杂亲和性 Pod
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: affinity-pod
spec:
  containers:
  - name: test
    image: nginx:alpine
    resources:
      requests:
        cpu: 100m
        memory: 128Mi
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: disktype
            operator: In
            values:
            - ssd
          - key: zone
            operator: In
            values:
            - zone-a
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        preference:
          matchExpressions:
          - key: kubernetes.io/hostname
            operator: In
            values:
            - node-1
EOF

# 3. 等待 Pod 就绪
kubectl wait --for=condition=Ready pod affinity-pod --timeout=60s

# 4. 查看调度结果
echo "查看调度结果："
kubectl get pod affinity-pod -o wide

# 5. 查看节点标签
echo "查看节点标签："
kubectl get nodes --show-labels

# 6. 清理
kubectl delete pod affinity-pod
kubectl label nodes node-1 disktype-
kubectl label nodes node-2 disktype-
kubectl label nodes node-1 zone-
kubectl label nodes node-2 zone-

实验2：Pod 反亲和实现高可用

#!/bin/bash
# Pod 反亲和实现高可用

echo "=== Pod 反亲和实现高可用 ==="

# 1. 创建反亲和 Deployment
kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ha-deployment
spec:
  replicas: 3
  selector:
    matchLabels:
      app: ha-app
  template:
    metadata:
      labels:
        app: ha-app
    spec:
      containers:
      - name: test
        image: nginx:alpine
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - ha-app
            topologyKey: kubernetes.io/hostname
EOF

# 2. 等待 Deployment 就绪
kubectl wait --for=condition=Available deployment ha-deployment --timeout=60s

# 3. 查看 Pod 分布
echo "查看 Pod 分布："
kubectl get pods -o wide

# 4. 验证反亲和效果
echo "验证反亲和效果："
kubectl get pods -o wide | grep ha-app | awk '{print $7}' | sort | uniq -c

# 5. 测试故障转移
echo "测试故障转移："
kubectl delete pod $(kubectl get pods -l app=ha-app -o jsonpath='{.items[0].metadata.name}')

# 6. 等待新 Pod 创建
kubectl wait --for=condition=Ready pod -l app=ha-app --timeout=60s

# 7. 查看新 Pod 分布
echo "查看新 Pod 分布："
kubectl get pods -o wide

# 8. 清理
kubectl delete deployment ha-deployment

实验3：污点与容忍的生产实践

#!/bin/bash
# 污点与容忍的生产实践

echo "=== 污点与容忍的生产实践 ==="

# 1. 给节点添加污点
kubectl taint nodes node-1 gpu=true:NoSchedule
kubectl taint nodes node-2 gpu=true:NoSchedule

# 2. 创建普通 Pod（应该无法调度）
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: normal-pod
spec:
  containers:
  - name: test
    image: nginx:alpine
    resources:
      requests:
        cpu: 100m
        memory: 128Mi
EOF

# 3. 等待一段时间
sleep 10

# 4. 查看 Pod 状态
echo "查看普通 Pod 状态："
kubectl get pod normal-pod

# 5. 创建带容忍的 Pod
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
  - name: test
    image: nginx:alpine
    resources:
      requests:
        cpu: 100m
        memory: 128Mi
  tolerations:
  - key: gpu
    operator: Equal
    value: "true"
    effect: NoSchedule
EOF

# 6. 等待 Pod 就绪
kubectl wait --for=condition=Ready pod gpu-pod --timeout=60s

# 7. 查看调度结果
echo "查看调度结果："
kubectl get pod gpu-pod -o wide

# 8. 清理
kubectl delete pod normal-pod gpu-pod
kubectl taint nodes node-1 gpu-
kubectl taint nodes node-2 gpu-

实验4：优先级与抢占机制

#!/bin/bash
# 优先级与抢占机制

echo "=== 优先级与抢占机制 ==="

# 1. 创建优先级类
kubectl apply -f - <<EOF
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000
globalDefault: false
description: "High priority class"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: low-priority
value: 100
globalDefault: false
description: "Low priority class"
EOF

# 2. 创建低优先级 Pod
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: low-priority-pod
spec:
  containers:
  - name: test
    image: nginx:alpine
    resources:
      requests:
        cpu: 100m
        memory: 128Mi
  priorityClassName: low-priority
EOF

# 3. 等待低优先级 Pod 就绪
kubectl wait --for=condition=Ready pod low-priority-pod --timeout=60s

# 4. 创建高优先级 Pod
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: high-priority-pod
spec:
  containers:
  - name: test
    image: nginx:alpine
    resources:
      requests:
        cpu: 100m
        memory: 128Mi
  priorityClassName: high-priority
EOF

# 5. 等待高优先级 Pod 就绪
kubectl wait --for=condition=Ready pod high-priority-pod --timeout=60s

# 6. 查看 Pod 状态
echo "查看 Pod 状态："
kubectl get pods -o wide

# 7. 查看优先级
echo "查看优先级："
kubectl get pod low-priority-pod -o jsonpath='{.spec.priority}'
kubectl get pod high-priority-pod -o jsonpath='{.spec.priority}'

# 8. 清理
kubectl delete pod low-priority-pod high-priority-pod
kubectl delete priorityclass high-priority low-priority

实验环境搭建

快速搭建脚本

#!/bin/bash
# Kubernetes 调度器实验环境搭建脚本

set -e

echo "开始搭建 Kubernetes 调度器实验环境..."

# 1. 创建实验集群
kind create cluster --name scheduler-test --config - <<EOF
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: worker
- role: worker
networking:
  podSubnet: "10.244.0.0/16"
  serviceSubnet: "10.96.0.0/12"
EOF

# 2. 等待集群就绪
kubectl wait --for=condition=Ready node --all --timeout=60s

# 3. 安装测试工具
kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: scheduler-tools
spec:
  replicas: 1
  selector:
    matchLabels:
      app: scheduler-tools
  template:
    metadata:
      labels:
        app: scheduler-tools
    spec:
      containers:
      - name: tools
        image: nixery.dev/shell/stress-ng
        command: ["sleep", "3600"]
EOF

# 4. 创建测试资源
kubectl apply -f - <<EOF
# 测试资源 1：CPU 密集型
apiVersion: v1
kind: Pod
metadata:
  name: cpu-intensive
spec:
  containers:
  - name: test
    image: nixery.dev/shell/stress-ng
    command: ["stress-ng", "--cpu", "2", "--timeout", "60s"]
    resources:
      requests:
        cpu: 1
        memory: 128Mi
      limits:
        cpu: 2
        memory: 256Mi
---
# 测试资源 2：内存密集型
apiVersion: v1
kind: Pod
metadata:
  name: memory-intensive
spec:
  containers:
  - name: test
    image: nixery.dev/shell/stress-ng
    command: ["stress-ng", "--vm", "1", "--vm-bytes", "256M", "--timeout", "60s"]
    resources:
      requests:
        cpu: 100m
        memory: 256Mi
      limits:
        cpu: 200m
        memory: 512Mi
EOF

echo "Kubernetes 调度器实验环境搭建完成！"
echo "运行 'kubectl get nodes' 查看集群状态"
echo "运行 'kubectl get pods' 查看 Pod 状态"

测试用例集合

# 创建测试用例
kubectl apply -f - <<EOF
# 测试用例1：基础调度
apiVersion: v1
kind: Pod
metadata:
  name: basic-scheduling
spec:
  containers:
  - name: test
    image: nginx:alpine
    resources:
      requests:
        cpu: 100m
        memory: 128Mi
---
# 测试用例2：资源限制
apiVersion: v1
kind: Pod
metadata:
  name: resource-limited
spec:
  containers:
  - name: test
    image: nginx:alpine
    resources:
      requests:
        cpu: 200m
        memory: 256Mi
      limits:
        cpu: 400m
        memory: 512Mi
---
# 测试用例3：节点选择
apiVersion: v1
kind: Pod
metadata:
  name: node-selected
spec:
  containers:
  - name: test
    image: nginx:alpine
  nodeSelector:
    kubernetes.io/hostname: node-1
---
# 测试用例4：亲和性调度
apiVersion: v1
kind: Pod
metadata:
  name: affinity-scheduled
spec:
  containers:
  - name: test
    image: nginx:alpine
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/hostname
            operator: In
            values:
            - node-1
EOF

命令速查表

调度管理命令

命令	功能	示例
`kubectl get pods -o wide`	查看 Pod 调度结果	`kubectl get pods -o wide`
`kubectl describe pod <pod>`	查看 Pod 调度详情	`kubectl describe pod my-pod`
`kubectl get nodes -o wide`	查看节点状态	`kubectl get nodes -o wide`
`kubectl describe node <node>`	查看节点详细信息	`kubectl describe node node-1`
`kubectl top node`	查看节点资源使用	`kubectl top node`
`kubectl top pod`	查看 Pod 资源使用	`kubectl top pod`

调度策略命令

命令	功能	示例
`kubectl label node <node> <key>=<value>`	给节点打标签	`kubectl label node node-1 disktype=ssd`
`kubectl taint node <node> <key>=<value>:<effect>`	给节点添加污点	`kubectl taint node node-1 gpu=true:NoSchedule`
`kubectl get priorityclass`	查看优先级类	`kubectl get priorityclass`
`kubectl get events --sort-by=.lastTimestamp`	查看调度事件	`kubectl get events --sort-by=.lastTimestamp`

资源管理命令

命令	功能	示例
`kubectl get resourcequota`	查看资源配额	`kubectl get resourcequota`
`kubectl get limitrange`	查看限制范围	`kubectl get limitrange`
`kubectl describe resourcequota <quota>`	查看配额详情	`kubectl describe resourcequota default`
`kubectl describe limitrange <limit>`	查看限制详情	`kubectl describe limitrange default`

调度器调试命令

命令	功能	示例
`kubectl logs -n kube-system -l component=kube-scheduler`	查看调度器日志	`kubectl logs -n kube-system -l component=kube-scheduler`
`kubectl get --raw /metrics	grep scheduler`	查看调度器指标
`kubectl get configmap -n kube-system	grep scheduler`	查看调度器配置