HiHuo
首页
博客
手册
工具
关于
首页
博客
手册
工具
关于
  • 运维手册

    • Kubernetes 全栈实战与性能原理教程
    • 第一部分:Linux 基础与系统原理
    • 第二部分:Kubernetes 网络深度解析
    • 第三部分:Kubernetes 存储管理实战
    • 第四部分:Kubernetes 调度器深度解析
    • 第五部分:Kubernetes 性能调优实战
    • 第六部分:CNI 与 eBPF 网络深度实践
    • 第七部分:Kubernetes 生产级调优案例
    • 第八部分:命令速查与YAML模板库
    • 第九部分:实验环境快速搭建指南
    • 第十部分:面试题库与进阶路径
    • 第11章:Kubernetes 网络·存储·大文件排查专项手册

第四部分:Kubernetes 调度器深度解析

从资源分配到优先级抢占的完整调度体系

目录

  • 第6章:Kubernetes 调度器与资源分配机制
  • 调度器工作流程可视化
  • 资源隔离实战
  • 调度策略高级应用
  • 实验环境搭建
  • 命令速查表

第6章:Kubernetes 调度器与资源分配机制

6.1 调度器工作流程总览

Kubernetes 调度器的职责:

决定哪个 Pod 运行在哪个 Node 上

6.1.1 整体流程

Pod 创建 → Pending 状态
    ↓
kube-scheduler 监听新 Pod
    ↓
1️⃣ 预选(过滤) → 哪些节点可运行
2️⃣ 优选(打分) → 哪个节点最合适
    ↓
调度绑定 → kubelet 在该节点启动容器

6.1.2 流程图

API Server
    ↓
Scheduler Watch 新 Pod
    ↓
Filter 阶段
    ↓
Score 阶段
    ↓
Bind 阶段

6.2 Filter 阶段(预选)

调度器首先过滤掉不符合要求的节点。

6.2.1 常见过滤规则(Predicates)

规则含义示例
PodFitsResources节点资源是否足够(CPU/内存)检查 requests 是否满足
PodFitsHostPorts端口是否被占用检查 hostPort 冲突
PodFitsNodeName是否匹配 nodeName检查指定节点
PodFitsAffinity是否满足亲和/反亲和检查节点亲和性
NodeUnschedulable节点是否被标记不可调度检查节点状态
TaintsTolerations节点污点 Pod 是否能容忍检查污点容忍

6.2.2 自定义过滤规则

apiVersion: v1
kind: ConfigMap
metadata:
  name: scheduler-config
  namespace: kube-system
data:
  policy.cfg: |
    {
      "kind": "Policy",
      "apiVersion": "v1",
      "predicates": [
        {"name": "PodFitsResources"},
        {"name": "PodFitsHostPorts"},
        {"name": "PodFitsNodeName"},
        {"name": "PodFitsAffinity"},
        {"name": "NodeUnschedulable"},
        {"name": "TaintsTolerations"}
      ]
    }

6.3 Score 阶段(优选)

预选后,对可用节点进行"打分"。

6.3.1 默认策略

策略说明权重
LeastRequestedPriority资源使用最少优先1
BalancedResourceAllocationCPU/内存均衡1
NodeAffinityPriority节点亲和匹配度1
ImageLocalityPriority本地已有镜像优先1

6.3.2 打分公式

NodeScore = ∑(Weight_i × PluginScore_i)

6.3.3 自定义打分规则

apiVersion: v1
kind: ConfigMap
metadata:
  name: scheduler-config
  namespace: kube-system
data:
  policy.cfg: |
    {
      "kind": "Policy",
      "apiVersion": "v1",
      "priorities": [
        {"name": "LeastRequestedPriority", "weight": 1},
        {"name": "BalancedResourceAllocation", "weight": 1},
        {"name": "NodeAffinityPriority", "weight": 1},
        {"name": "ImageLocalityPriority", "weight": 1}
      ]
    }

6.4 绑定与执行阶段

确定节点后,scheduler 调用:

Bind(pod, node)

API Server 更新 Pod 的 .spec.nodeName:

spec:
  nodeName: node-1

之后 kubelet 会拉起容器、挂载卷、设置 Cgroups 等。

6.5 Kubernetes 调度策略类型

策略含义示例适用场景
nodeName固定节点spec.nodeName: node01特殊硬件需求
nodeSelector简单标签选择disktype: ssd简单节点选择
nodeAffinity高级亲和表达式匹配、权重打分复杂节点选择
podAffinityPod 间亲和同机房/同节点部署高可用部署
podAntiAffinity反亲和避免部署在同一节点负载分散
taints/tolerations污点与容忍控制特定 Pod 调度到特定节点节点隔离
priority/preemption优先级与抢占高优 Pod 可驱逐低优 Pod资源竞争

调度器工作流程可视化

实验1:调度过程可视化

#!/bin/bash
# 调度过程可视化脚本

echo "=== 调度过程可视化 ==="

# 1. 创建测试 Pod
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: test-pod
spec:
  containers:
  - name: test
    image: nginx:alpine
    resources:
      requests:
        cpu: 100m
        memory: 128Mi
      limits:
        cpu: 200m
        memory: 256Mi
EOF

# 2. 监控调度过程
echo "监控调度过程..."
kubectl get events --watch --sort-by=.lastTimestamp &

# 3. 查看 Pod 状态变化
echo "查看 Pod 状态变化..."
kubectl get pod test-pod -w &

# 4. 查看调度器日志
echo "查看调度器日志..."
kubectl logs -n kube-system -l component=kube-scheduler --tail=50

# 5. 查看最终调度结果
echo "最终调度结果:"
kubectl get pod test-pod -o wide

# 6. 清理
kubectl delete pod test-pod

实验2:调度器配置验证

#!/bin/bash
# 调度器配置验证脚本

echo "=== 调度器配置验证 ==="

# 1. 查看当前调度器配置
echo "1. 查看当前调度器配置:"
kubectl get configmap -n kube-system | grep scheduler

# 2. 查看调度器日志
echo "2. 查看调度器日志:"
kubectl logs -n kube-system -l component=kube-scheduler --tail=100

# 3. 查看调度器指标
echo "3. 查看调度器指标:"
kubectl get --raw /metrics | grep scheduler

# 4. 测试调度性能
echo "4. 测试调度性能:"
time kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: test-deployment
spec:
  replicas: 10
  selector:
    matchLabels:
      app: test
  template:
    metadata:
      labels:
        app: test
    spec:
      containers:
      - name: test
        image: nginx:alpine
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
EOF

# 5. 查看调度结果
echo "5. 查看调度结果:"
kubectl get pods -o wide

# 6. 清理
kubectl delete deployment test-deployment

资源隔离实战

实验1:CPU Cgroup 限制实验

#!/bin/bash
# CPU Cgroup 限制实验

echo "=== CPU Cgroup 限制实验 ==="

# 1. 创建 CPU 限制 Pod
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: cpu-test
spec:
  containers:
  - name: test
    image: nginx:alpine
    resources:
      requests:
        cpu: 100m
      limits:
        cpu: 200m
    command: ["sh", "-c"]
    args:
    - |
      while true; do
        echo "CPU usage: \$(cat /proc/loadavg)"
        sleep 1
      done
EOF

# 2. 等待 Pod 就绪
kubectl wait --for=condition=Ready pod cpu-test --timeout=60s

# 3. 查看 Pod 资源使用
echo "查看 Pod 资源使用:"
kubectl top pod cpu-test

# 4. 查看 Cgroup 配置
echo "查看 Cgroup 配置:"
kubectl exec cpu-test -- cat /sys/fs/cgroup/cpu/cpu.cfs_quota_us
kubectl exec cpu-test -- cat /sys/fs/cgroup/cpu/cpu.cfs_period_us

# 5. 运行 CPU 密集型任务
echo "运行 CPU 密集型任务:"
kubectl exec cpu-test -- sh -c "yes > /dev/null &"

# 6. 监控 CPU 使用
echo "监控 CPU 使用:"
for i in {1..10}; do
  kubectl top pod cpu-test
  sleep 2
done

# 7. 清理
kubectl delete pod cpu-test

实验2:内存 QoS 分级测试

#!/bin/bash
# 内存 QoS 分级测试

echo "=== 内存 QoS 分级测试 ==="

# 1. 创建不同 QoS 级别的 Pod
kubectl apply -f - <<EOF
# Guaranteed QoS
apiVersion: v1
kind: Pod
metadata:
  name: guaranteed-pod
spec:
  containers:
  - name: test
    image: nginx:alpine
    resources:
      requests:
        memory: 256Mi
      limits:
        memory: 256Mi
    command: ["sh", "-c"]
    args:
    - |
      echo "Guaranteed QoS Pod"
      sleep 3600
---
# Burstable QoS
apiVersion: v1
kind: Pod
metadata:
  name: burstable-pod
spec:
  containers:
  - name: test
    image: nginx:alpine
    resources:
      requests:
        memory: 128Mi
      limits:
        memory: 512Mi
    command: ["sh", "-c"]
    args:
    - |
      echo "Burstable QoS Pod"
      sleep 3600
---
# BestEffort QoS
apiVersion: v1
kind: Pod
metadata:
  name: besteffort-pod
spec:
  containers:
  - name: test
    image: nginx:alpine
    command: ["sh", "-c"]
    args:
    - |
      echo "BestEffort QoS Pod"
      sleep 3600
EOF

# 2. 等待 Pod 就绪
kubectl wait --for=condition=Ready pod guaranteed-pod burstable-pod besteffort-pod --timeout=60s

# 3. 查看 Pod QoS 级别
echo "查看 Pod QoS 级别:"
kubectl get pod guaranteed-pod -o jsonpath='{.status.qosClass}'
kubectl get pod burstable-pod -o jsonpath='{.status.qosClass}'
kubectl get pod besteffort-pod -o jsonpath='{.status.qosClass}'

# 4. 查看 OOM 分数
echo "查看 OOM 分数:"
kubectl exec guaranteed-pod -- cat /proc/self/oom_score
kubectl exec burstable-pod -- cat /proc/self/oom_score
kubectl exec besteffort-pod -- cat /proc/self/oom_score

# 5. 查看内存限制
echo "查看内存限制:"
kubectl exec guaranteed-pod -- cat /sys/fs/cgroup/memory/memory.limit_in_bytes
kubectl exec burstable-pod -- cat /sys/fs/cgroup/memory/memory.limit_in_bytes
kubectl exec besteffort-pod -- cat /sys/fs/cgroup/memory/memory.limit_in_bytes

# 6. 清理
kubectl delete pod guaranteed-pod burstable-pod besteffort-pod

实验3:NUMA 亲和性配置实践

#!/bin/bash
# NUMA 亲和性配置实践

echo "=== NUMA 亲和性配置实践 ==="

# 1. 检查节点 NUMA 拓扑
echo "检查节点 NUMA 拓扑:"
kubectl get nodes -o jsonpath='{.items[0].status.addresses[0].address}' | xargs -I {} ssh {} "numactl --hardware"

# 2. 创建 NUMA 亲和性 Pod
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: numa-pod
spec:
  containers:
  - name: test
    image: nginx:alpine
    resources:
      requests:
        cpu: 2
        memory: 1Gi
      limits:
        cpu: 2
        memory: 1Gi
    command: ["sh", "-c"]
    args:
    - |
      echo "NUMA Pod"
      numactl --hardware
      sleep 3600
  nodeSelector:
    kubernetes.io/hostname: node-1
EOF

# 3. 等待 Pod 就绪
kubectl wait --for=condition=Ready pod numa-pod --timeout=60s

# 4. 查看 NUMA 绑定
echo "查看 NUMA 绑定:"
kubectl exec numa-pod -- numactl --show

# 5. 查看 CPU 绑定
echo "查看 CPU 绑定:"
kubectl exec numa-pod -- taskset -cp $$

# 6. 清理
kubectl delete pod numa-pod

调度策略高级应用

实验1:节点亲和性复杂场景

#!/bin/bash
# 节点亲和性复杂场景

echo "=== 节点亲和性复杂场景 ==="

# 1. 给节点打标签
kubectl label nodes node-1 disktype=ssd
kubectl label nodes node-2 disktype=hdd
kubectl label nodes node-1 zone=zone-a
kubectl label nodes node-2 zone=zone-b

# 2. 创建复杂亲和性 Pod
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: affinity-pod
spec:
  containers:
  - name: test
    image: nginx:alpine
    resources:
      requests:
        cpu: 100m
        memory: 128Mi
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: disktype
            operator: In
            values:
            - ssd
          - key: zone
            operator: In
            values:
            - zone-a
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        preference:
          matchExpressions:
          - key: kubernetes.io/hostname
            operator: In
            values:
            - node-1
EOF

# 3. 等待 Pod 就绪
kubectl wait --for=condition=Ready pod affinity-pod --timeout=60s

# 4. 查看调度结果
echo "查看调度结果:"
kubectl get pod affinity-pod -o wide

# 5. 查看节点标签
echo "查看节点标签:"
kubectl get nodes --show-labels

# 6. 清理
kubectl delete pod affinity-pod
kubectl label nodes node-1 disktype-
kubectl label nodes node-2 disktype-
kubectl label nodes node-1 zone-
kubectl label nodes node-2 zone-

实验2:Pod 反亲和实现高可用

#!/bin/bash
# Pod 反亲和实现高可用

echo "=== Pod 反亲和实现高可用 ==="

# 1. 创建反亲和 Deployment
kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ha-deployment
spec:
  replicas: 3
  selector:
    matchLabels:
      app: ha-app
  template:
    metadata:
      labels:
        app: ha-app
    spec:
      containers:
      - name: test
        image: nginx:alpine
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - ha-app
            topologyKey: kubernetes.io/hostname
EOF

# 2. 等待 Deployment 就绪
kubectl wait --for=condition=Available deployment ha-deployment --timeout=60s

# 3. 查看 Pod 分布
echo "查看 Pod 分布:"
kubectl get pods -o wide

# 4. 验证反亲和效果
echo "验证反亲和效果:"
kubectl get pods -o wide | grep ha-app | awk '{print $7}' | sort | uniq -c

# 5. 测试故障转移
echo "测试故障转移:"
kubectl delete pod $(kubectl get pods -l app=ha-app -o jsonpath='{.items[0].metadata.name}')

# 6. 等待新 Pod 创建
kubectl wait --for=condition=Ready pod -l app=ha-app --timeout=60s

# 7. 查看新 Pod 分布
echo "查看新 Pod 分布:"
kubectl get pods -o wide

# 8. 清理
kubectl delete deployment ha-deployment

实验3:污点与容忍的生产实践

#!/bin/bash
# 污点与容忍的生产实践

echo "=== 污点与容忍的生产实践 ==="

# 1. 给节点添加污点
kubectl taint nodes node-1 gpu=true:NoSchedule
kubectl taint nodes node-2 gpu=true:NoSchedule

# 2. 创建普通 Pod(应该无法调度)
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: normal-pod
spec:
  containers:
  - name: test
    image: nginx:alpine
    resources:
      requests:
        cpu: 100m
        memory: 128Mi
EOF

# 3. 等待一段时间
sleep 10

# 4. 查看 Pod 状态
echo "查看普通 Pod 状态:"
kubectl get pod normal-pod

# 5. 创建带容忍的 Pod
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
  - name: test
    image: nginx:alpine
    resources:
      requests:
        cpu: 100m
        memory: 128Mi
  tolerations:
  - key: gpu
    operator: Equal
    value: "true"
    effect: NoSchedule
EOF

# 6. 等待 Pod 就绪
kubectl wait --for=condition=Ready pod gpu-pod --timeout=60s

# 7. 查看调度结果
echo "查看调度结果:"
kubectl get pod gpu-pod -o wide

# 8. 清理
kubectl delete pod normal-pod gpu-pod
kubectl taint nodes node-1 gpu-
kubectl taint nodes node-2 gpu-

实验4:优先级与抢占机制

#!/bin/bash
# 优先级与抢占机制

echo "=== 优先级与抢占机制 ==="

# 1. 创建优先级类
kubectl apply -f - <<EOF
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000
globalDefault: false
description: "High priority class"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: low-priority
value: 100
globalDefault: false
description: "Low priority class"
EOF

# 2. 创建低优先级 Pod
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: low-priority-pod
spec:
  containers:
  - name: test
    image: nginx:alpine
    resources:
      requests:
        cpu: 100m
        memory: 128Mi
  priorityClassName: low-priority
EOF

# 3. 等待低优先级 Pod 就绪
kubectl wait --for=condition=Ready pod low-priority-pod --timeout=60s

# 4. 创建高优先级 Pod
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: high-priority-pod
spec:
  containers:
  - name: test
    image: nginx:alpine
    resources:
      requests:
        cpu: 100m
        memory: 128Mi
  priorityClassName: high-priority
EOF

# 5. 等待高优先级 Pod 就绪
kubectl wait --for=condition=Ready pod high-priority-pod --timeout=60s

# 6. 查看 Pod 状态
echo "查看 Pod 状态:"
kubectl get pods -o wide

# 7. 查看优先级
echo "查看优先级:"
kubectl get pod low-priority-pod -o jsonpath='{.spec.priority}'
kubectl get pod high-priority-pod -o jsonpath='{.spec.priority}'

# 8. 清理
kubectl delete pod low-priority-pod high-priority-pod
kubectl delete priorityclass high-priority low-priority

实验环境搭建

快速搭建脚本

#!/bin/bash
# Kubernetes 调度器实验环境搭建脚本

set -e

echo "开始搭建 Kubernetes 调度器实验环境..."

# 1. 创建实验集群
kind create cluster --name scheduler-test --config - <<EOF
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: worker
- role: worker
networking:
  podSubnet: "10.244.0.0/16"
  serviceSubnet: "10.96.0.0/12"
EOF

# 2. 等待集群就绪
kubectl wait --for=condition=Ready node --all --timeout=60s

# 3. 安装测试工具
kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: scheduler-tools
spec:
  replicas: 1
  selector:
    matchLabels:
      app: scheduler-tools
  template:
    metadata:
      labels:
        app: scheduler-tools
    spec:
      containers:
      - name: tools
        image: nixery.dev/shell/stress-ng
        command: ["sleep", "3600"]
EOF

# 4. 创建测试资源
kubectl apply -f - <<EOF
# 测试资源 1:CPU 密集型
apiVersion: v1
kind: Pod
metadata:
  name: cpu-intensive
spec:
  containers:
  - name: test
    image: nixery.dev/shell/stress-ng
    command: ["stress-ng", "--cpu", "2", "--timeout", "60s"]
    resources:
      requests:
        cpu: 1
        memory: 128Mi
      limits:
        cpu: 2
        memory: 256Mi
---
# 测试资源 2:内存密集型
apiVersion: v1
kind: Pod
metadata:
  name: memory-intensive
spec:
  containers:
  - name: test
    image: nixery.dev/shell/stress-ng
    command: ["stress-ng", "--vm", "1", "--vm-bytes", "256M", "--timeout", "60s"]
    resources:
      requests:
        cpu: 100m
        memory: 256Mi
      limits:
        cpu: 200m
        memory: 512Mi
EOF

echo "Kubernetes 调度器实验环境搭建完成!"
echo "运行 'kubectl get nodes' 查看集群状态"
echo "运行 'kubectl get pods' 查看 Pod 状态"

测试用例集合

# 创建测试用例
kubectl apply -f - <<EOF
# 测试用例1:基础调度
apiVersion: v1
kind: Pod
metadata:
  name: basic-scheduling
spec:
  containers:
  - name: test
    image: nginx:alpine
    resources:
      requests:
        cpu: 100m
        memory: 128Mi
---
# 测试用例2:资源限制
apiVersion: v1
kind: Pod
metadata:
  name: resource-limited
spec:
  containers:
  - name: test
    image: nginx:alpine
    resources:
      requests:
        cpu: 200m
        memory: 256Mi
      limits:
        cpu: 400m
        memory: 512Mi
---
# 测试用例3:节点选择
apiVersion: v1
kind: Pod
metadata:
  name: node-selected
spec:
  containers:
  - name: test
    image: nginx:alpine
  nodeSelector:
    kubernetes.io/hostname: node-1
---
# 测试用例4:亲和性调度
apiVersion: v1
kind: Pod
metadata:
  name: affinity-scheduled
spec:
  containers:
  - name: test
    image: nginx:alpine
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/hostname
            operator: In
            values:
            - node-1
EOF

命令速查表

调度管理命令

命令功能示例
kubectl get pods -o wide查看 Pod 调度结果kubectl get pods -o wide
kubectl describe pod <pod>查看 Pod 调度详情kubectl describe pod my-pod
kubectl get nodes -o wide查看节点状态kubectl get nodes -o wide
kubectl describe node <node>查看节点详细信息kubectl describe node node-1
kubectl top node查看节点资源使用kubectl top node
kubectl top pod查看 Pod 资源使用kubectl top pod

调度策略命令

命令功能示例
kubectl label node <node> <key>=<value>给节点打标签kubectl label node node-1 disktype=ssd
kubectl taint node <node> <key>=<value>:<effect>给节点添加污点kubectl taint node node-1 gpu=true:NoSchedule
kubectl get priorityclass查看优先级类kubectl get priorityclass
kubectl get events --sort-by=.lastTimestamp查看调度事件kubectl get events --sort-by=.lastTimestamp

资源管理命令

命令功能示例
kubectl get resourcequota查看资源配额kubectl get resourcequota
kubectl get limitrange查看限制范围kubectl get limitrange
kubectl describe resourcequota <quota>查看配额详情kubectl describe resourcequota default
kubectl describe limitrange <limit>查看限制详情kubectl describe limitrange default

调度器调试命令

命令功能示例
kubectl logs -n kube-system -l component=kube-scheduler查看调度器日志kubectl logs -n kube-system -l component=kube-scheduler
`kubectl get --raw /metricsgrep scheduler`查看调度器指标
`kubectl get configmap -n kube-systemgrep scheduler`查看调度器配置

Prev
第三部分:Kubernetes 存储管理实战
Next
第五部分:Kubernetes 性能调优实战