K8s 怎么管理 GPU：Device Plugin 原理

在 K8s 上跑 AI 任务，第一个问题是：K8s 怎么知道机器上有 GPU？

答案是 Device Plugin。

问题背景

K8s 原生只认 CPU 和内存：

resources:
  requests:
    cpu: "1"
    memory: "1Gi"

GPU 是「扩展资源」，K8s 不认识。直接写 nvidia.com/gpu: 1 是不行的。

需要一个机制告诉 K8s：这台机器有 GPU，有几张，怎么分配。

这就是 Device Plugin 的作用。

Device Plugin 是什么

Device Plugin 是 K8s 的插件机制，让第三方设备（GPU、FPGA、网卡等）可以接入 K8s。

工作原理：

Device Plugin（如 NVIDIA）
       ↓
   注册到 Kubelet
       ↓
   上报设备信息（有几张 GPU）
       ↓
   Pod 请求 GPU → Kubelet 调用 Device Plugin 分配
       ↓
   Device Plugin 返回分配结果（设备路径、环境变量）
       ↓
   Pod 启动，能用 GPU

NVIDIA Device Plugin

NVIDIA 官方提供的 GPU Device Plugin。

安装

方式一：DaemonSet

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml

方式二：Helm

helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm install nvidia-device-plugin nvdp/nvidia-device-plugin

前提条件

NVIDIA 驱动：节点上装好 NVIDIA 驱动
NVIDIA Container Toolkit：容器运行时能调用 GPU
容器运行时配置：containerd/docker 配置 nvidia runtime

验证

# 查看节点 GPU 资源
kubectl describe node gpu-node-1 | grep nvidia

# 输出：
#   nvidia.com/gpu: 8
#   nvidia.com/gpu: 8

能看到 GPU 数量就说明 Device Plugin 工作正常。

使用 GPU

请求 GPU

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
  - name: cuda
    image: nvidia/cuda:12.0-base
    resources:
      limits:
        nvidia.com/gpu: 2  # 请求 2 张 GPU

注意事项

必须用 limits

# 正确
limits:
  nvidia.com/gpu: 1

# 错误（不支持 requests）
requests:
  nvidia.com/gpu: 1

GPU 不能超售
- CPU 可以 requests < limits
- GPU 必须 requests = limits
- 一张 GPU 只能给一个容器
不支持小数
```
# 错误
nvidia.com/gpu: 0.5
```

Device Plugin 架构

核心接口

Device Plugin 实现的 gRPC 接口：

// 注册
Register(ctx, RegisterRequest) RegisterResponse

// 上报设备列表
ListAndWatch(Empty, stream DevicePlugin_ListAndWatchServer)

// 分配设备
Allocate(ctx, AllocateRequest) AllocateResponse

ListAndWatch

持续上报设备状态：

func (p *NvidiaDevicePlugin) ListAndWatch(e *pluginapi.Empty,
    s pluginapi.DevicePlugin_ListAndWatchServer) error {

    // 获取 GPU 列表
    devices := p.getDevices()

    // 发送给 Kubelet
    s.Send(&pluginapi.ListAndWatchResponse{Devices: devices})

    // 监听变化（GPU 热插拔、故障）
    for {
        select {
        case event := <-p.eventChan:
            devices = p.getDevices()
            s.Send(&pluginapi.ListAndWatchResponse{Devices: devices})
        }
    }
}

Allocate

分配设备给容器：

func (p *NvidiaDevicePlugin) Allocate(ctx context.Context,
    req *pluginapi.AllocateRequest) (*pluginapi.AllocateResponse, error) {

    responses := &pluginapi.AllocateResponse{}

    for _, r := range req.ContainerRequests {
        response := &pluginapi.ContainerAllocateResponse{
            // 设备路径
            Devices: []*pluginapi.DeviceSpec{
                {ContainerPath: "/dev/nvidia0", HostPath: "/dev/nvidia0"},
            },
            // 环境变量
            Envs: map[string]string{
                "NVIDIA_VISIBLE_DEVICES": "0,1",
            },
        }
        responses.ContainerResponses = append(responses.ContainerResponses, response)
    }

    return responses, nil
}

GPU 健康检查

Device Plugin 会监控 GPU 健康状态：

func (p *NvidiaDevicePlugin) healthCheck() {
    for {
        // 调用 NVML 检查 GPU 状态
        for _, device := range p.devices {
            health := nvml.DeviceGetHealthState(device)
            if health != nvml.HEALTH_OK {
                // 标记设备不健康
                p.markUnhealthy(device)
            }
        }
        time.Sleep(time.Minute)
    }
}

不健康的 GPU 不会被调度。

常见问题

1. GPU 资源不显示

kubectl describe node | grep nvidia
# 没有输出

排查：

# 检查 Device Plugin 是否运行
kubectl get pods -n kube-system | grep nvidia

# 检查日志
kubectl logs -n kube-system nvidia-device-plugin-xxx

# 检查驱动
nvidia-smi

2. Pod 无法调度

0/3 nodes are available: 3 Insufficient nvidia.com/gpu

原因：

GPU 被占满了
请求数量超过单节点 GPU 数
节点上没有 GPU

3. 容器内看不到 GPU

# 容器内执行
nvidia-smi
# command not found 或者 no devices found

排查：

# 检查环境变量
echo $NVIDIA_VISIBLE_DEVICES

# 检查设备挂载
ls -la /dev/nvidia*

高级配置

时间分片（Time Slicing）

允许多个 Pod 共享一张 GPU：

# ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
data:
  config.yaml: |
    version: v1
    sharing:
      timeSlicing:
        resources:
        - name: nvidia.com/gpu
          replicas: 4  # 每张 GPU 切成 4 份

注意：时间分片不是真正的隔离，只是调度上允许超售。

MIG（Multi-Instance GPU）

H100/A100 支持硬件级切分：

# 请求 MIG 设备
resources:
  limits:
    nvidia.com/mig-3g.20gb: 1

后面会专门讲 GPU 切分。

指定特定 GPU

默认情况下，Device Plugin 随机分配 GPU。如果要指定：

# 不推荐，但可以用 nodeSelector + 手动控制
nodeSelector:
  gpu-type: A100

更好的方式是用 Topology Awareness 或 GPU Operator。

GPU Operator

NVIDIA GPU Operator 是更完整的方案，自动管理：

NVIDIA 驱动
Container Toolkit
Device Plugin
DCGM Exporter（监控）
MIG Manager

helm install gpu-operator nvidia/gpu-operator

优点：

自动化程度高
驱动、运行时一起管
支持 MIG

缺点：

复杂度高
不够灵活
有些场景不适用

小结

Device Plugin 核心知识：

是什么：

K8s 插件机制
让 K8s 能管理扩展设备（GPU）

工作流程：

Device Plugin 注册到 Kubelet
上报 GPU 列表
Pod 请求 GPU
Kubelet 调用 Device Plugin 分配
返回设备路径和环境变量

使用方式：

resources:
  limits:
    nvidia.com/gpu: 1

常见问题：

GPU 不显示：检查 Device Plugin 和驱动
调度失败：检查资源是否足够
容器内没有 GPU：检查环境变量和设备挂载

下一篇讲 Volcano：AI 场景的批处理调度器。