13-AI平台集成

Kubernetes GPU 调度与 MLOps 全解析

学习目标

通过本模块学习，你将掌握：

GPU 节点配置与调度
NVIDIA Device Plugin 部署
AI 模型训练（分布式训练）
模型推理服务部署
Kubeflow/KServe 平台实践
Volcano 批处理调度
MLOps 完整工作流

一、AI 平台架构概览

AI/ML 工作流

graph TD
    A[数据准备] --> B[模型训练]
    B --> C[模型验证]
    C --> D{模型评估}
    D -->|不合格| B
    D -->|合格| E[模型注册]
    E --> F[模型部署]
    F --> G[在线推理]
    G --> H[监控与反馈]
    H --> A

Kubernetes AI 生态

┌─────────────────────────────────────────────────────────────┐
│                    Kubernetes AI 平台                       │
├─────────────────────────────────────────────────────────────┤
│  训练层                                                      │
│  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐           │
│  │ Kubeflow    │ │ Volcano     │ │ Ray         │           │
│  │ Training    │ │ Batch       │ │ Distributed │           │
│  └─────────────┘ └─────────────┘ └─────────────┘           │
├─────────────────────────────────────────────────────────────┤
│  推理层                                                      │
│  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐           │
│  │ KServe      │ │ Triton      │ │ vLLM        │           │
│  │ Serving     │ │ Inference   │ │ LLM Serving │           │
│  └─────────────┘ └─────────────┘ └─────────────┘           │
├─────────────────────────────────────────────────────────────┤
│  工作流层                                                    │
│  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐           │
│  │ Argo        │ │ Kubeflow    │ │ MLflow      │           │
│  │ Workflows   │ │ Pipelines   │ │ Tracking    │           │
│  └─────────────┘ └─────────────┘ └─────────────┘           │
├─────────────────────────────────────────────────────────────┤
│  资源层                                                      │
│  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐           │
│  │ GPU         │ │ NPU         │ │ 存储        │           │
│  │ Scheduling  │ │ Scheduling  │ │ (PVC/S3)    │           │
│  └─────────────┘ └─────────────┘ └─────────────┘           │
└─────────────────────────────────────────────────────────────┘

二、GPU 资源配置

2.1 NVIDIA Device Plugin 部署

安装 NVIDIA Container Toolkit

# 在节点上安装 NVIDIA Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart containerd

部署 NVIDIA Device Plugin

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        name: nvidia-device-plugin
    spec:
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      priorityClassName: system-node-critical
      containers:
      - name: nvidia-device-plugin
        image: nvcr.io/nvidia/k8s-device-plugin:v0.14.0
        env:
        - name: FAIL_ON_INIT_ERROR
          value: "false"
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
        - name: device-plugin
          mountPath: /var/lib/kubelet/device-plugins
      volumes:
      - name: device-plugin
        hostPath:
          path: /var/lib/kubelet/device-plugins

2.2 NVIDIA GPU Operator（推荐）

# 使用 Helm 安装 GPU Operator
apiVersion: v1
kind: Namespace
metadata:
  name: gpu-operator
---
# 添加 Helm 仓库
# helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
# helm repo update
# helm install gpu-operator nvidia/gpu-operator -n gpu-operator

# GPU Operator 配置
apiVersion: v1
kind: ConfigMap
metadata:
  name: gpu-operator-config
  namespace: gpu-operator
data:
  config.yaml: |
    operator:
      defaultRuntime: containerd
    driver:
      enabled: true
      version: "535.104.05"
    toolkit:
      enabled: true
    devicePlugin:
      enabled: true
      config:
        name: "time-slicing-config"
        default: "any"
    dcgmExporter:
      enabled: true
    gfd:
      enabled: true
    migManager:
      enabled: false
    nodeStatusExporter:
      enabled: true

2.3 GPU 节点标签和污点

# 给 GPU 节点打标签
kubectl label nodes gpu-node-1 gpu-type=nvidia-a100
kubectl label nodes gpu-node-1 gpu-count=8

# 添加 GPU 污点
kubectl taint nodes gpu-node-1 nvidia.com/gpu=true:NoSchedule

# 查看 GPU 资源
kubectl describe node gpu-node-1 | grep -A 10 "Allocatable"

2.4 使用 GPU 的 Pod 配置

# 请求 GPU 资源
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
  - name: cuda-container
    image: nvidia/cuda:12.2.0-base-ubuntu22.04
    command: ["nvidia-smi"]
    resources:
      limits:
        nvidia.com/gpu: 1  # 请求 1 个 GPU
---
# 请求多个 GPU
apiVersion: v1
kind: Pod
metadata:
  name: multi-gpu-pod
spec:
  containers:
  - name: training
    image: pytorch/pytorch:2.0.1-cuda11.8-cudnn8-runtime
    resources:
      limits:
        nvidia.com/gpu: 4  # 请求 4 个 GPU
---
# GPU 类型选择（使用节点选择器）
apiVersion: v1
kind: Pod
metadata:
  name: a100-gpu-pod
spec:
  nodeSelector:
    gpu-type: nvidia-a100
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
  containers:
  - name: training
    image: pytorch/pytorch:2.0.1-cuda11.8-cudnn8-runtime
    resources:
      limits:
        nvidia.com/gpu: 1

三、模型训练

3.1 单机训练 Job

apiVersion: batch/v1
kind: Job
metadata:
  name: pytorch-training
  namespace: ml-training
spec:
  backoffLimit: 3
  template:
    spec:
      restartPolicy: OnFailure
      containers:
      - name: pytorch
        image: pytorch/pytorch:2.0.1-cuda11.8-cudnn8-runtime
        command:
        - python
        - train.py
        - --epochs=100
        - --batch-size=64
        - --model-dir=/output
        resources:
          limits:
            nvidia.com/gpu: 1
            cpu: 4
            memory: 16Gi
          requests:
            nvidia.com/gpu: 1
            cpu: 4
            memory: 16Gi
        volumeMounts:
        - name: training-data
          mountPath: /data
        - name: output
          mountPath: /output
        env:
        - name: CUDA_VISIBLE_DEVICES
          value: "0"
      volumes:
      - name: training-data
        persistentVolumeClaim:
          claimName: training-data-pvc
      - name: output
        persistentVolumeClaim:
          claimName: model-output-pvc

3.2 分布式训练（PyTorch DDP）

# PyTorchJob CRD
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: pytorch-distributed-training
  namespace: ml-training
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
          - name: pytorch
            image: pytorch/pytorch:2.0.1-cuda11.8-cudnn8-runtime
            command:
            - python
            - -m
            - torch.distributed.run
            - --nproc_per_node=4
            - --nnodes=2
            - --node_rank=$(RANK)
            - --master_addr=$(MASTER_ADDR)
            - --master_port=$(MASTER_PORT)
            - train_distributed.py
            resources:
              limits:
                nvidia.com/gpu: 4
                cpu: 16
                memory: 64Gi
            volumeMounts:
            - name: training-data
              mountPath: /data
            - name: output
              mountPath: /output
          volumes:
          - name: training-data
            persistentVolumeClaim:
              claimName: training-data-pvc
          - name: output
            persistentVolumeClaim:
              claimName: model-output-pvc
    Worker:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
          - name: pytorch
            image: pytorch/pytorch:2.0.1-cuda11.8-cudnn8-runtime
            command:
            - python
            - -m
            - torch.distributed.run
            - --nproc_per_node=4
            - --nnodes=2
            - --node_rank=$(RANK)
            - --master_addr=$(MASTER_ADDR)
            - --master_port=$(MASTER_PORT)
            - train_distributed.py
            resources:
              limits:
                nvidia.com/gpu: 4
                cpu: 16
                memory: 64Gi
            volumeMounts:
            - name: training-data
              mountPath: /data
            - name: output
              mountPath: /output
          volumes:
          - name: training-data
            persistentVolumeClaim:
              claimName: training-data-pvc
          - name: output
            persistentVolumeClaim:
              claimName: model-output-pvc

3.3 Volcano 批处理调度

# 安装 Volcano
# kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/master/installer/volcano-development.yaml

# Volcano Job 配置
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: ml-training-job
  namespace: ml-training
spec:
  minAvailable: 3
  schedulerName: volcano
  plugins:
    env: []
    svc: []
  policies:
  - event: PodEvicted
    action: RestartJob
  tasks:
  - name: master
    replicas: 1
    template:
      spec:
        containers:
        - name: training
          image: pytorch/pytorch:2.0.1-cuda11.8-cudnn8-runtime
          command: ["python", "train.py"]
          resources:
            limits:
              nvidia.com/gpu: 1
              cpu: 4
              memory: 16Gi
        restartPolicy: OnFailure
  - name: worker
    replicas: 2
    template:
      spec:
        containers:
        - name: training
          image: pytorch/pytorch:2.0.1-cuda11.8-cudnn8-runtime
          command: ["python", "train.py", "--worker"]
          resources:
            limits:
              nvidia.com/gpu: 2
              cpu: 8
              memory: 32Gi
        restartPolicy: OnFailure
---
# Volcano Queue 配置
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: ml-queue
spec:
  weight: 1
  capability:
    cpu: "64"
    memory: "256Gi"
    nvidia.com/gpu: "16"

四、模型推理服务

4.1 Triton Inference Server

apiVersion: apps/v1
kind: Deployment
metadata:
  name: triton-inference-server
  namespace: ml-inference
spec:
  replicas: 2
  selector:
    matchLabels:
      app: triton-server
  template:
    metadata:
      labels:
        app: triton-server
    spec:
      containers:
      - name: triton
        image: nvcr.io/nvidia/tritonserver:23.10-py3
        args:
        - tritonserver
        - --model-repository=/models
        - --strict-model-config=false
        - --log-verbose=1
        ports:
        - containerPort: 8000
          name: http
        - containerPort: 8001
          name: grpc
        - containerPort: 8002
          name: metrics
        resources:
          limits:
            nvidia.com/gpu: 1
            cpu: 4
            memory: 16Gi
          requests:
            nvidia.com/gpu: 1
            cpu: 4
            memory: 16Gi
        volumeMounts:
        - name: model-repository
          mountPath: /models
        livenessProbe:
          httpGet:
            path: /v2/health/live
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /v2/health/ready
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 10
      volumes:
      - name: model-repository
        persistentVolumeClaim:
          claimName: model-repository-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: triton-inference-service
  namespace: ml-inference
spec:
  selector:
    app: triton-server
  ports:
  - name: http
    port: 8000
    targetPort: 8000
  - name: grpc
    port: 8001
    targetPort: 8001
  - name: metrics
    port: 8002
    targetPort: 8002
  type: LoadBalancer

4.2 KServe InferenceService

# 安装 KServe
# kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.11.0/kserve.yaml

# InferenceService 配置
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: bert-model
  namespace: ml-inference
spec:
  predictor:
    minReplicas: 2
    maxReplicas: 10
    scaleTarget: 10
    scaleMetric: concurrency
    containerConcurrency: 10
    pytorch:
      storageUri: s3://models/bert-base-uncased
      resources:
        limits:
          nvidia.com/gpu: 1
          cpu: 4
          memory: 8Gi
        requests:
          nvidia.com/gpu: 1
          cpu: 2
          memory: 4Gi
      env:
      - name: STORAGE_URI
        value: s3://models/bert-base-uncased
---
# 使用自定义容器
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: custom-model
  namespace: ml-inference
spec:
  predictor:
    containers:
    - name: kserve-container
      image: myregistry.com/custom-predictor:v1.0
      resources:
        limits:
          nvidia.com/gpu: 1
          cpu: 4
          memory: 8Gi
      env:
      - name: MODEL_NAME
        value: custom-model
      - name: MODEL_PATH
        value: /mnt/models
      volumeMounts:
      - name: model-volume
        mountPath: /mnt/models
    volumes:
    - name: model-volume
      persistentVolumeClaim:
        claimName: model-pvc

4.3 vLLM 大语言模型推理

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama2
  namespace: ml-inference
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm-llama2
  template:
    metadata:
      labels:
        app: vllm-llama2
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        args:
        - --model
        - meta-llama/Llama-2-7b-chat-hf
        - --tensor-parallel-size
        - "4"
        - --gpu-memory-utilization
        - "0.9"
        - --max-model-len
        - "4096"
        ports:
        - containerPort: 8000
          name: http
        resources:
          limits:
            nvidia.com/gpu: 4
            cpu: 16
            memory: 64Gi
          requests:
            nvidia.com/gpu: 4
            cpu: 16
            memory: 64Gi
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-token
              key: token
        volumeMounts:
        - name: model-cache
          mountPath: /root/.cache/huggingface
      volumes:
      - name: model-cache
        persistentVolumeClaim:
          claimName: model-cache-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-service
  namespace: ml-inference
spec:
  selector:
    app: vllm-llama2
  ports:
  - port: 8000
    targetPort: 8000
  type: LoadBalancer

五、MLOps 工作流

5.1 Argo Workflows

# 安装 Argo Workflows
# kubectl create ns argo
# kubectl apply -n argo -f https://github.com/argoproj/argo-workflows/releases/download/v3.5.0/install.yaml

# ML 训练工作流
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: ml-training-pipeline-
  namespace: ml-workflows
spec:
  entrypoint: ml-pipeline
  volumeClaimTemplates:
  - metadata:
      name: workspace
    spec:
      accessModes: ["ReadWriteOnce"]
      resources:
        requests:
          storage: 100Gi
  
  templates:
  - name: ml-pipeline
    steps:
    - - name: data-preprocessing
        template: preprocess-data
    - - name: model-training
        template: train-model
    - - name: model-evaluation
        template: evaluate-model
    - - name: model-deployment
        template: deploy-model
        when: "{{steps.model-evaluation.outputs.parameters.accuracy}} > 0.9"
  
  - name: preprocess-data
    container:
      image: python:3.10
      command: [python]
      args:
      - -c
      - |
        import pandas as pd
        # 数据预处理逻辑
        print("Data preprocessing completed")
      volumeMounts:
      - name: workspace
        mountPath: /workspace
  
  - name: train-model
    container:
      image: pytorch/pytorch:2.0.1-cuda11.8-cudnn8-runtime
      command: [python]
      args: ["/workspace/train.py"]
      resources:
        limits:
          nvidia.com/gpu: 2
          cpu: 8
          memory: 32Gi
      volumeMounts:
      - name: workspace
        mountPath: /workspace
  
  - name: evaluate-model
    container:
      image: pytorch/pytorch:2.0.1-cuda11.8-cudnn8-runtime
      command: [python]
      args: ["/workspace/evaluate.py"]
      volumeMounts:
      - name: workspace
        mountPath: /workspace
    outputs:
      parameters:
      - name: accuracy
        valueFrom:
          path: /workspace/accuracy.txt
  
  - name: deploy-model
    resource:
      action: apply
      manifest: |
        apiVersion: serving.kserve.io/v1beta1
        kind: InferenceService
        metadata:
          name: trained-model
          namespace: ml-inference
        spec:
          predictor:
            pytorch:
              storageUri: s3://models/trained-model
              resources:
                limits:
                  nvidia.com/gpu: 1

5.2 Kubeflow Pipelines

# Kubeflow Pipeline 示例（Python SDK）
# pip install kfp

from kfp import dsl
from kfp import compiler

@dsl.component(
    base_image='pytorch/pytorch:2.0.1-cuda11.8-cudnn8-runtime',
    packages_to_install=['pandas', 'scikit-learn']
)
def preprocess_data(input_path: str, output_path: str):
    import pandas as pd
    # 数据预处理
    df = pd.read_csv(input_path)
    # 处理逻辑
    df.to_csv(output_path, index=False)

@dsl.component(
    base_image='pytorch/pytorch:2.0.1-cuda11.8-cudnn8-runtime'
)
def train_model(
    data_path: str,
    model_path: str,
    epochs: int = 100,
    batch_size: int = 64
) -> str:
    import torch
    # 训练逻辑
    return model_path

@dsl.component(
    base_image='pytorch/pytorch:2.0.1-cuda11.8-cudnn8-runtime'
)
def evaluate_model(model_path: str, test_data_path: str) -> float:
    # 评估逻辑
    accuracy = 0.95
    return accuracy

@dsl.pipeline(
    name='ML Training Pipeline',
    description='Complete ML training pipeline'
)
def ml_training_pipeline(
    input_data_path: str,
    model_output_path: str
):
    preprocess_task = preprocess_data(
        input_path=input_data_path,
        output_path='/tmp/processed_data.csv'
    )
    
    train_task = train_model(
        data_path=preprocess_task.output,
        model_path=model_output_path,
        epochs=100,
        batch_size=64
    ).set_gpu_limit(2).set_cpu_limit('8').set_memory_limit('32Gi')
    
    eval_task = evaluate_model(
        model_path=train_task.output,
        test_data_path='/data/test.csv'
    )
    
    # 条件部署
    with dsl.Condition(eval_task.output > 0.9):
        deploy_task = deploy_model(model_path=train_task.output)

# 编译 Pipeline
compiler.Compiler().compile(
    pipeline_func=ml_training_pipeline,
    package_path='ml_pipeline.yaml'
)

5.3 MLflow 模型跟踪

apiVersion: apps/v1
kind: Deployment
metadata:
  name: mlflow-server
  namespace: ml-platform
spec:
  replicas: 1
  selector:
    matchLabels:
      app: mlflow
  template:
    metadata:
      labels:
        app: mlflow
    spec:
      containers:
      - name: mlflow
        image: ghcr.io/mlflow/mlflow:v2.8.0
        command:
        - mlflow
        - server
        - --host=0.0.0.0
        - --port=5000
        - --backend-store-uri=postgresql://mlflow:password@postgres:5432/mlflow
        - --default-artifact-root=s3://mlflow-artifacts
        ports:
        - containerPort: 5000
        env:
        - name: AWS_ACCESS_KEY_ID
          valueFrom:
            secretKeyRef:
              name: aws-credentials
              key: access-key-id
        - name: AWS_SECRET_ACCESS_KEY
          valueFrom:
            secretKeyRef:
              name: aws-credentials
              key: secret-access-key
        resources:
          requests:
            cpu: 500m
            memory: 1Gi
          limits:
            cpu: 1000m
            memory: 2Gi
---
apiVersion: v1
kind: Service
metadata:
  name: mlflow-service
  namespace: ml-platform
spec:
  selector:
    app: mlflow
  ports:
  - port: 5000
    targetPort: 5000
  type: LoadBalancer

六、向量数据库集成

6.1 Milvus 部署

# 使用 Helm 安装 Milvus
# helm repo add milvus https://milvus-io.github.io/milvus-helm/
# helm install milvus milvus/milvus -n milvus --create-namespace

apiVersion: v1
kind: Service
metadata:
  name: milvus
  namespace: ml-platform
spec:
  selector:
    app: milvus
  ports:
  - name: grpc
    port: 19530
    targetPort: 19530
  - name: metrics
    port: 9091
    targetPort: 9091
  type: LoadBalancer
---
# Milvus Standalone 配置
apiVersion: apps/v1
kind: Deployment
metadata:
  name: milvus-standalone
  namespace: ml-platform
spec:
  replicas: 1
  selector:
    matchLabels:
      app: milvus
  template:
    metadata:
      labels:
        app: milvus
    spec:
      containers:
      - name: milvus
        image: milvusdb/milvus:v2.3.0
        command: ["milvus", "run", "standalone"]
        ports:
        - containerPort: 19530
        - containerPort: 9091
        env:
        - name: ETCD_ENDPOINTS
          value: "etcd:2379"
        - name: MINIO_ADDRESS
          value: "minio:9000"
        resources:
          limits:
            cpu: 4
            memory: 8Gi
          requests:
            cpu: 2
            memory: 4Gi
        volumeMounts:
        - name: milvus-data
          mountPath: /var/lib/milvus
      volumes:
      - name: milvus-data
        persistentVolumeClaim:
          claimName: milvus-pvc

️ 七、命令速记

GPU 管理命令

# 查看 GPU 资源
kubectl describe nodes | grep -A 10 "Allocatable"
kubectl get nodes -o jsonpath='{.items[*].status.allocatable.nvidia\.com/gpu}'

# 查看 GPU Pod
kubectl get pods -A -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[*].resources.limits.nvidia\.com/gpu}{"\n"}{end}'

# 在 Pod 内查看 GPU
kubectl exec -it <gpu-pod> -- nvidia-smi

训练任务管理

# PyTorchJob
kubectl get pytorchjobs -A
kubectl describe pytorchjob <job-name>

# Volcano Job
kubectl get vcjobs -A
kubectl describe vcjob <job-name>

# 查看训练日志
kubectl logs -f <training-pod>

推理服务管理

# KServe InferenceService
kubectl get inferenceservices -A
kubectl describe inferenceservice <service-name>

# 测试推理服务
curl -X POST http://<inference-service>/v1/models/<model-name>:predict \
  -H "Content-Type: application/json" \
  -d '{"instances": [[1.0, 2.0, 3.0]]}'