第四部分:Kubernetes 调度器深度解析
从资源分配到优先级抢占的完整调度体系
目录
第6章:Kubernetes 调度器与资源分配机制
6.1 调度器工作流程总览
Kubernetes 调度器的职责:
决定哪个 Pod 运行在哪个 Node 上
6.1.1 整体流程
Pod 创建 → Pending 状态
↓
kube-scheduler 监听新 Pod
↓
1️⃣ 预选(过滤) → 哪些节点可运行
2️⃣ 优选(打分) → 哪个节点最合适
↓
调度绑定 → kubelet 在该节点启动容器
6.1.2 流程图
API Server
↓
Scheduler Watch 新 Pod
↓
Filter 阶段
↓
Score 阶段
↓
Bind 阶段
6.2 Filter 阶段(预选)
调度器首先过滤掉不符合要求的节点。
6.2.1 常见过滤规则(Predicates)
| 规则 | 含义 | 示例 |
|---|---|---|
| PodFitsResources | 节点资源是否足够(CPU/内存) | 检查 requests 是否满足 |
| PodFitsHostPorts | 端口是否被占用 | 检查 hostPort 冲突 |
| PodFitsNodeName | 是否匹配 nodeName | 检查指定节点 |
| PodFitsAffinity | 是否满足亲和/反亲和 | 检查节点亲和性 |
| NodeUnschedulable | 节点是否被标记不可调度 | 检查节点状态 |
| TaintsTolerations | 节点污点 Pod 是否能容忍 | 检查污点容忍 |
6.2.2 自定义过滤规则
apiVersion: v1
kind: ConfigMap
metadata:
name: scheduler-config
namespace: kube-system
data:
policy.cfg: |
{
"kind": "Policy",
"apiVersion": "v1",
"predicates": [
{"name": "PodFitsResources"},
{"name": "PodFitsHostPorts"},
{"name": "PodFitsNodeName"},
{"name": "PodFitsAffinity"},
{"name": "NodeUnschedulable"},
{"name": "TaintsTolerations"}
]
}
6.3 Score 阶段(优选)
预选后,对可用节点进行"打分"。
6.3.1 默认策略
| 策略 | 说明 | 权重 |
|---|---|---|
| LeastRequestedPriority | 资源使用最少优先 | 1 |
| BalancedResourceAllocation | CPU/内存均衡 | 1 |
| NodeAffinityPriority | 节点亲和匹配度 | 1 |
| ImageLocalityPriority | 本地已有镜像优先 | 1 |
6.3.2 打分公式
NodeScore = ∑(Weight_i × PluginScore_i)
6.3.3 自定义打分规则
apiVersion: v1
kind: ConfigMap
metadata:
name: scheduler-config
namespace: kube-system
data:
policy.cfg: |
{
"kind": "Policy",
"apiVersion": "v1",
"priorities": [
{"name": "LeastRequestedPriority", "weight": 1},
{"name": "BalancedResourceAllocation", "weight": 1},
{"name": "NodeAffinityPriority", "weight": 1},
{"name": "ImageLocalityPriority", "weight": 1}
]
}
6.4 绑定与执行阶段
确定节点后,scheduler 调用:
Bind(pod, node)
API Server 更新 Pod 的 .spec.nodeName:
spec:
nodeName: node-1
之后 kubelet 会拉起容器、挂载卷、设置 Cgroups 等。
6.5 Kubernetes 调度策略类型
| 策略 | 含义 | 示例 | 适用场景 |
|---|---|---|---|
| nodeName | 固定节点 | spec.nodeName: node01 | 特殊硬件需求 |
| nodeSelector | 简单标签选择 | disktype: ssd | 简单节点选择 |
| nodeAffinity | 高级亲和 | 表达式匹配、权重打分 | 复杂节点选择 |
| podAffinity | Pod 间亲和 | 同机房/同节点部署 | 高可用部署 |
| podAntiAffinity | 反亲和 | 避免部署在同一节点 | 负载分散 |
| taints/tolerations | 污点与容忍 | 控制特定 Pod 调度到特定节点 | 节点隔离 |
| priority/preemption | 优先级与抢占 | 高优 Pod 可驱逐低优 Pod | 资源竞争 |
调度器工作流程可视化
实验1:调度过程可视化
#!/bin/bash
# 调度过程可视化脚本
echo "=== 调度过程可视化 ==="
# 1. 创建测试 Pod
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: test-pod
spec:
containers:
- name: test
image: nginx:alpine
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 200m
memory: 256Mi
EOF
# 2. 监控调度过程
echo "监控调度过程..."
kubectl get events --watch --sort-by=.lastTimestamp &
# 3. 查看 Pod 状态变化
echo "查看 Pod 状态变化..."
kubectl get pod test-pod -w &
# 4. 查看调度器日志
echo "查看调度器日志..."
kubectl logs -n kube-system -l component=kube-scheduler --tail=50
# 5. 查看最终调度结果
echo "最终调度结果:"
kubectl get pod test-pod -o wide
# 6. 清理
kubectl delete pod test-pod
实验2:调度器配置验证
#!/bin/bash
# 调度器配置验证脚本
echo "=== 调度器配置验证 ==="
# 1. 查看当前调度器配置
echo "1. 查看当前调度器配置:"
kubectl get configmap -n kube-system | grep scheduler
# 2. 查看调度器日志
echo "2. 查看调度器日志:"
kubectl logs -n kube-system -l component=kube-scheduler --tail=100
# 3. 查看调度器指标
echo "3. 查看调度器指标:"
kubectl get --raw /metrics | grep scheduler
# 4. 测试调度性能
echo "4. 测试调度性能:"
time kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
name: test-deployment
spec:
replicas: 10
selector:
matchLabels:
app: test
template:
metadata:
labels:
app: test
spec:
containers:
- name: test
image: nginx:alpine
resources:
requests:
cpu: 100m
memory: 128Mi
EOF
# 5. 查看调度结果
echo "5. 查看调度结果:"
kubectl get pods -o wide
# 6. 清理
kubectl delete deployment test-deployment
资源隔离实战
实验1:CPU Cgroup 限制实验
#!/bin/bash
# CPU Cgroup 限制实验
echo "=== CPU Cgroup 限制实验 ==="
# 1. 创建 CPU 限制 Pod
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: cpu-test
spec:
containers:
- name: test
image: nginx:alpine
resources:
requests:
cpu: 100m
limits:
cpu: 200m
command: ["sh", "-c"]
args:
- |
while true; do
echo "CPU usage: \$(cat /proc/loadavg)"
sleep 1
done
EOF
# 2. 等待 Pod 就绪
kubectl wait --for=condition=Ready pod cpu-test --timeout=60s
# 3. 查看 Pod 资源使用
echo "查看 Pod 资源使用:"
kubectl top pod cpu-test
# 4. 查看 Cgroup 配置
echo "查看 Cgroup 配置:"
kubectl exec cpu-test -- cat /sys/fs/cgroup/cpu/cpu.cfs_quota_us
kubectl exec cpu-test -- cat /sys/fs/cgroup/cpu/cpu.cfs_period_us
# 5. 运行 CPU 密集型任务
echo "运行 CPU 密集型任务:"
kubectl exec cpu-test -- sh -c "yes > /dev/null &"
# 6. 监控 CPU 使用
echo "监控 CPU 使用:"
for i in {1..10}; do
kubectl top pod cpu-test
sleep 2
done
# 7. 清理
kubectl delete pod cpu-test
实验2:内存 QoS 分级测试
#!/bin/bash
# 内存 QoS 分级测试
echo "=== 内存 QoS 分级测试 ==="
# 1. 创建不同 QoS 级别的 Pod
kubectl apply -f - <<EOF
# Guaranteed QoS
apiVersion: v1
kind: Pod
metadata:
name: guaranteed-pod
spec:
containers:
- name: test
image: nginx:alpine
resources:
requests:
memory: 256Mi
limits:
memory: 256Mi
command: ["sh", "-c"]
args:
- |
echo "Guaranteed QoS Pod"
sleep 3600
---
# Burstable QoS
apiVersion: v1
kind: Pod
metadata:
name: burstable-pod
spec:
containers:
- name: test
image: nginx:alpine
resources:
requests:
memory: 128Mi
limits:
memory: 512Mi
command: ["sh", "-c"]
args:
- |
echo "Burstable QoS Pod"
sleep 3600
---
# BestEffort QoS
apiVersion: v1
kind: Pod
metadata:
name: besteffort-pod
spec:
containers:
- name: test
image: nginx:alpine
command: ["sh", "-c"]
args:
- |
echo "BestEffort QoS Pod"
sleep 3600
EOF
# 2. 等待 Pod 就绪
kubectl wait --for=condition=Ready pod guaranteed-pod burstable-pod besteffort-pod --timeout=60s
# 3. 查看 Pod QoS 级别
echo "查看 Pod QoS 级别:"
kubectl get pod guaranteed-pod -o jsonpath='{.status.qosClass}'
kubectl get pod burstable-pod -o jsonpath='{.status.qosClass}'
kubectl get pod besteffort-pod -o jsonpath='{.status.qosClass}'
# 4. 查看 OOM 分数
echo "查看 OOM 分数:"
kubectl exec guaranteed-pod -- cat /proc/self/oom_score
kubectl exec burstable-pod -- cat /proc/self/oom_score
kubectl exec besteffort-pod -- cat /proc/self/oom_score
# 5. 查看内存限制
echo "查看内存限制:"
kubectl exec guaranteed-pod -- cat /sys/fs/cgroup/memory/memory.limit_in_bytes
kubectl exec burstable-pod -- cat /sys/fs/cgroup/memory/memory.limit_in_bytes
kubectl exec besteffort-pod -- cat /sys/fs/cgroup/memory/memory.limit_in_bytes
# 6. 清理
kubectl delete pod guaranteed-pod burstable-pod besteffort-pod
实验3:NUMA 亲和性配置实践
#!/bin/bash
# NUMA 亲和性配置实践
echo "=== NUMA 亲和性配置实践 ==="
# 1. 检查节点 NUMA 拓扑
echo "检查节点 NUMA 拓扑:"
kubectl get nodes -o jsonpath='{.items[0].status.addresses[0].address}' | xargs -I {} ssh {} "numactl --hardware"
# 2. 创建 NUMA 亲和性 Pod
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: numa-pod
spec:
containers:
- name: test
image: nginx:alpine
resources:
requests:
cpu: 2
memory: 1Gi
limits:
cpu: 2
memory: 1Gi
command: ["sh", "-c"]
args:
- |
echo "NUMA Pod"
numactl --hardware
sleep 3600
nodeSelector:
kubernetes.io/hostname: node-1
EOF
# 3. 等待 Pod 就绪
kubectl wait --for=condition=Ready pod numa-pod --timeout=60s
# 4. 查看 NUMA 绑定
echo "查看 NUMA 绑定:"
kubectl exec numa-pod -- numactl --show
# 5. 查看 CPU 绑定
echo "查看 CPU 绑定:"
kubectl exec numa-pod -- taskset -cp $$
# 6. 清理
kubectl delete pod numa-pod
调度策略高级应用
实验1:节点亲和性复杂场景
#!/bin/bash
# 节点亲和性复杂场景
echo "=== 节点亲和性复杂场景 ==="
# 1. 给节点打标签
kubectl label nodes node-1 disktype=ssd
kubectl label nodes node-2 disktype=hdd
kubectl label nodes node-1 zone=zone-a
kubectl label nodes node-2 zone=zone-b
# 2. 创建复杂亲和性 Pod
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: affinity-pod
spec:
containers:
- name: test
image: nginx:alpine
resources:
requests:
cpu: 100m
memory: 128Mi
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: disktype
operator: In
values:
- ssd
- key: zone
operator: In
values:
- zone-a
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- node-1
EOF
# 3. 等待 Pod 就绪
kubectl wait --for=condition=Ready pod affinity-pod --timeout=60s
# 4. 查看调度结果
echo "查看调度结果:"
kubectl get pod affinity-pod -o wide
# 5. 查看节点标签
echo "查看节点标签:"
kubectl get nodes --show-labels
# 6. 清理
kubectl delete pod affinity-pod
kubectl label nodes node-1 disktype-
kubectl label nodes node-2 disktype-
kubectl label nodes node-1 zone-
kubectl label nodes node-2 zone-
实验2:Pod 反亲和实现高可用
#!/bin/bash
# Pod 反亲和实现高可用
echo "=== Pod 反亲和实现高可用 ==="
# 1. 创建反亲和 Deployment
kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
name: ha-deployment
spec:
replicas: 3
selector:
matchLabels:
app: ha-app
template:
metadata:
labels:
app: ha-app
spec:
containers:
- name: test
image: nginx:alpine
resources:
requests:
cpu: 100m
memory: 128Mi
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- ha-app
topologyKey: kubernetes.io/hostname
EOF
# 2. 等待 Deployment 就绪
kubectl wait --for=condition=Available deployment ha-deployment --timeout=60s
# 3. 查看 Pod 分布
echo "查看 Pod 分布:"
kubectl get pods -o wide
# 4. 验证反亲和效果
echo "验证反亲和效果:"
kubectl get pods -o wide | grep ha-app | awk '{print $7}' | sort | uniq -c
# 5. 测试故障转移
echo "测试故障转移:"
kubectl delete pod $(kubectl get pods -l app=ha-app -o jsonpath='{.items[0].metadata.name}')
# 6. 等待新 Pod 创建
kubectl wait --for=condition=Ready pod -l app=ha-app --timeout=60s
# 7. 查看新 Pod 分布
echo "查看新 Pod 分布:"
kubectl get pods -o wide
# 8. 清理
kubectl delete deployment ha-deployment
实验3:污点与容忍的生产实践
#!/bin/bash
# 污点与容忍的生产实践
echo "=== 污点与容忍的生产实践 ==="
# 1. 给节点添加污点
kubectl taint nodes node-1 gpu=true:NoSchedule
kubectl taint nodes node-2 gpu=true:NoSchedule
# 2. 创建普通 Pod(应该无法调度)
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: normal-pod
spec:
containers:
- name: test
image: nginx:alpine
resources:
requests:
cpu: 100m
memory: 128Mi
EOF
# 3. 等待一段时间
sleep 10
# 4. 查看 Pod 状态
echo "查看普通 Pod 状态:"
kubectl get pod normal-pod
# 5. 创建带容忍的 Pod
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
containers:
- name: test
image: nginx:alpine
resources:
requests:
cpu: 100m
memory: 128Mi
tolerations:
- key: gpu
operator: Equal
value: "true"
effect: NoSchedule
EOF
# 6. 等待 Pod 就绪
kubectl wait --for=condition=Ready pod gpu-pod --timeout=60s
# 7. 查看调度结果
echo "查看调度结果:"
kubectl get pod gpu-pod -o wide
# 8. 清理
kubectl delete pod normal-pod gpu-pod
kubectl taint nodes node-1 gpu-
kubectl taint nodes node-2 gpu-
实验4:优先级与抢占机制
#!/bin/bash
# 优先级与抢占机制
echo "=== 优先级与抢占机制 ==="
# 1. 创建优先级类
kubectl apply -f - <<EOF
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000000
globalDefault: false
description: "High priority class"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: low-priority
value: 100
globalDefault: false
description: "Low priority class"
EOF
# 2. 创建低优先级 Pod
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: low-priority-pod
spec:
containers:
- name: test
image: nginx:alpine
resources:
requests:
cpu: 100m
memory: 128Mi
priorityClassName: low-priority
EOF
# 3. 等待低优先级 Pod 就绪
kubectl wait --for=condition=Ready pod low-priority-pod --timeout=60s
# 4. 创建高优先级 Pod
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: high-priority-pod
spec:
containers:
- name: test
image: nginx:alpine
resources:
requests:
cpu: 100m
memory: 128Mi
priorityClassName: high-priority
EOF
# 5. 等待高优先级 Pod 就绪
kubectl wait --for=condition=Ready pod high-priority-pod --timeout=60s
# 6. 查看 Pod 状态
echo "查看 Pod 状态:"
kubectl get pods -o wide
# 7. 查看优先级
echo "查看优先级:"
kubectl get pod low-priority-pod -o jsonpath='{.spec.priority}'
kubectl get pod high-priority-pod -o jsonpath='{.spec.priority}'
# 8. 清理
kubectl delete pod low-priority-pod high-priority-pod
kubectl delete priorityclass high-priority low-priority
实验环境搭建
快速搭建脚本
#!/bin/bash
# Kubernetes 调度器实验环境搭建脚本
set -e
echo "开始搭建 Kubernetes 调度器实验环境..."
# 1. 创建实验集群
kind create cluster --name scheduler-test --config - <<EOF
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: worker
- role: worker
networking:
podSubnet: "10.244.0.0/16"
serviceSubnet: "10.96.0.0/12"
EOF
# 2. 等待集群就绪
kubectl wait --for=condition=Ready node --all --timeout=60s
# 3. 安装测试工具
kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
name: scheduler-tools
spec:
replicas: 1
selector:
matchLabels:
app: scheduler-tools
template:
metadata:
labels:
app: scheduler-tools
spec:
containers:
- name: tools
image: nixery.dev/shell/stress-ng
command: ["sleep", "3600"]
EOF
# 4. 创建测试资源
kubectl apply -f - <<EOF
# 测试资源 1:CPU 密集型
apiVersion: v1
kind: Pod
metadata:
name: cpu-intensive
spec:
containers:
- name: test
image: nixery.dev/shell/stress-ng
command: ["stress-ng", "--cpu", "2", "--timeout", "60s"]
resources:
requests:
cpu: 1
memory: 128Mi
limits:
cpu: 2
memory: 256Mi
---
# 测试资源 2:内存密集型
apiVersion: v1
kind: Pod
metadata:
name: memory-intensive
spec:
containers:
- name: test
image: nixery.dev/shell/stress-ng
command: ["stress-ng", "--vm", "1", "--vm-bytes", "256M", "--timeout", "60s"]
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 200m
memory: 512Mi
EOF
echo "Kubernetes 调度器实验环境搭建完成!"
echo "运行 'kubectl get nodes' 查看集群状态"
echo "运行 'kubectl get pods' 查看 Pod 状态"
测试用例集合
# 创建测试用例
kubectl apply -f - <<EOF
# 测试用例1:基础调度
apiVersion: v1
kind: Pod
metadata:
name: basic-scheduling
spec:
containers:
- name: test
image: nginx:alpine
resources:
requests:
cpu: 100m
memory: 128Mi
---
# 测试用例2:资源限制
apiVersion: v1
kind: Pod
metadata:
name: resource-limited
spec:
containers:
- name: test
image: nginx:alpine
resources:
requests:
cpu: 200m
memory: 256Mi
limits:
cpu: 400m
memory: 512Mi
---
# 测试用例3:节点选择
apiVersion: v1
kind: Pod
metadata:
name: node-selected
spec:
containers:
- name: test
image: nginx:alpine
nodeSelector:
kubernetes.io/hostname: node-1
---
# 测试用例4:亲和性调度
apiVersion: v1
kind: Pod
metadata:
name: affinity-scheduled
spec:
containers:
- name: test
image: nginx:alpine
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- node-1
EOF
命令速查表
调度管理命令
| 命令 | 功能 | 示例 |
|---|---|---|
kubectl get pods -o wide | 查看 Pod 调度结果 | kubectl get pods -o wide |
kubectl describe pod <pod> | 查看 Pod 调度详情 | kubectl describe pod my-pod |
kubectl get nodes -o wide | 查看节点状态 | kubectl get nodes -o wide |
kubectl describe node <node> | 查看节点详细信息 | kubectl describe node node-1 |
kubectl top node | 查看节点资源使用 | kubectl top node |
kubectl top pod | 查看 Pod 资源使用 | kubectl top pod |
调度策略命令
| 命令 | 功能 | 示例 |
|---|---|---|
kubectl label node <node> <key>=<value> | 给节点打标签 | kubectl label node node-1 disktype=ssd |
kubectl taint node <node> <key>=<value>:<effect> | 给节点添加污点 | kubectl taint node node-1 gpu=true:NoSchedule |
kubectl get priorityclass | 查看优先级类 | kubectl get priorityclass |
kubectl get events --sort-by=.lastTimestamp | 查看调度事件 | kubectl get events --sort-by=.lastTimestamp |
资源管理命令
| 命令 | 功能 | 示例 |
|---|---|---|
kubectl get resourcequota | 查看资源配额 | kubectl get resourcequota |
kubectl get limitrange | 查看限制范围 | kubectl get limitrange |
kubectl describe resourcequota <quota> | 查看配额详情 | kubectl describe resourcequota default |
kubectl describe limitrange <limit> | 查看限制详情 | kubectl describe limitrange default |
调度器调试命令
| 命令 | 功能 | 示例 |
|---|---|---|
kubectl logs -n kube-system -l component=kube-scheduler | 查看调度器日志 | kubectl logs -n kube-system -l component=kube-scheduler |
| `kubectl get --raw /metrics | grep scheduler` | 查看调度器指标 |
| `kubectl get configmap -n kube-system | grep scheduler` | 查看调度器配置 |