HiHuo
首页
博客
手册
工具
关于
首页
博客
手册
工具
关于

在K8s上部署大模型推理服务:从0到日均千万调用

一、背景与挑战

1.1 业务规模

我们的大模型推理服务从上线至今经历了快速增长:

  • 日调用量:1000万+ 次
  • 峰值QPS:5000+
  • 模型规模:7B-70B参数不等
  • GPU集群:100+ 张A100/A800

1.2 核心挑战

挑战领域具体问题影响
GPU调度GPU资源碎片化、利用率低成本浪费30%+
并发处理请求排队、响应延迟高用户体验差
成本控制GPU成本占总成本80%+ROI压力大
稳定性OOM、Pod驱逐频繁可用性 < 99%

二、GPU调度架构设计

2.1 整体架构

2.2 GPU资源隔离方案

方案对比

方案GPU利用率隔离性复杂度适用场景
整卡分配60-70%强低大模型(70B+)
MPS共享75-85%中中中等模型(7B-13B)
MIG切片80-90%强高混合负载
时间片轮转85-95%弱中开发测试环境

我们采用MIG切片 + 整卡分配的混合方案:

# GPU Operator 配置 - 启用MIG
apiVersion: v1
kind: ConfigMap
metadata:
  name: gpu-operator-config
  namespace: gpu-operator
data:
  config.yaml: |
    migManager:
      enabled: true
      config:
        name: "default-mig-config"
        default: "all-balanced"  # 均衡切分策略
    devicePlugin:
      config:
        name: "mig-config"
        default: "all-1g.10gb"   # 默认1g.10gb切片

MIG配置脚本

#!/bin/bash
# 配置A100 MIG实例

NODE_NAME="gpu-node-1"

# 启用MIG模式
kubectl exec -it nvidia-mig-manager-daemonset-xxxxx -n gpu-operator -- \
  nvidia-smi -i 0 -mig 1

# 创建MIG实例 - 7个1g.10gb实例
kubectl exec -it nvidia-mig-manager-daemonset-xxxxx -n gpu-operator -- \
  nvidia-smi mig -cgi 1g.10gb,1g.10gb,1g.10gb,1g.10gb,1g.10gb,1g.10gb,1g.10gb -C

# 创建计算实例
for i in {0..6}; do
  kubectl exec -it nvidia-mig-manager-daemonset-xxxxx -n gpu-operator -- \
    nvidia-smi mig -cci -gi $i
done

# 验证MIG配置
kubectl exec -it nvidia-mig-manager-daemonset-xxxxx -n gpu-operator -- \
  nvidia-smi -L

2.3 节点亲和性与污点配置

# GPU节点分组标签
apiVersion: v1
kind: Node
metadata:
  name: gpu-node-prod-1
  labels:
    node-role.kubernetes.io/gpu: "true"
    gpu.nvidia.com/class: "A100-SXM4-80GB"
    gpu.nvidia.com/mig.capable: "true"
    workload.type: "llm-inference"  # 工作负载类型
    model.size: "70b"                # 支持的模型大小
---
# 添加污点防止非GPU工作负载调度
apiVersion: v1
kind: Node
metadata:
  name: gpu-node-prod-1
spec:
  taints:
  - effect: NoSchedule
    key: nvidia.com/gpu
    value: "true"
  - effect: NoSchedule
    key: workload.type
    value: "llm-inference"

2.4 智能调度策略

使用Volcano Scheduler实现批量调度和队列管理:

# Volcano Queue配置
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: llm-inference-queue
spec:
  weight: 1
  capability:
    nvidia.com/gpu: 40  # 队列最多使用40张GPU
  reclaimable: true
---
# PodGroup配置 - 批量调度
apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:
  name: vllm-workers
  namespace: llm-inference
spec:
  minMember: 4              # 最少启动4个Pod
  queue: llm-inference-queue
  priorityClassName: high-priority
---
# Deployment配置
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama2-70b
  namespace: llm-inference
spec:
  replicas: 8
  template:
    metadata:
      annotations:
        scheduling.volcano.sh/group-name: "vllm-workers"
    spec:
      schedulerName: volcano
      priorityClassName: high-priority
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: model.size
                operator: In
                values: ["70b", "any"]
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values: ["vllm-llama2-70b"]
              topologyKey: kubernetes.io/hostname
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      containers:
      - name: vllm-server
        image: vllm/vllm-openai:v0.3.0
        command:
        - python3
        - -m
        - vllm.entrypoints.openai.api_server
        args:
        - --model=/models/llama2-70b
        - --tensor-parallel-size=4  # 张量并行
        - --max-num-seqs=256        # 最大并发序列
        - --gpu-memory-utilization=0.95
        resources:
          limits:
            nvidia.com/gpu: 4  # 4卡并行
          requests:
            nvidia.com/gpu: 4
            memory: "200Gi"
        volumeMounts:
        - name: model-storage
          mountPath: /models
      volumes:
      - name: model-storage
        persistentVolumeClaim:
          claimName: llm-models-pvc

三、模型并发优化

3.1 推理框架选型

经过实际测试对比(模型:Llama2-13B,输入512 tokens,输出256 tokens):

框架吞吐量P99延迟GPU利用率内存占用特点
vLLM2400 req/s180ms92%18GBPagedAttention,开源
TGI1800 req/s220ms85%20GBHuggingFace官方
Triton2100 req/s200ms88%19GB多框架支持
TensorRT-LLM2600 req/s150ms95%16GB最优性能,闭源

最终选型:

  • 主力框架:vLLM(性价比最高)
  • 备用框架:Triton + TensorRT-LLM(关键业务)

3.2 vLLM核心配置

# vLLM配置文件
from vllm import LLM, SamplingParams

llm = LLM(
    model="/models/llama2-13b",
    tensor_parallel_size=2,           # 张量并行度
    pipeline_parallel_size=1,         # 流水线并行度
    max_num_seqs=256,                 # 最大并发序列数
    max_num_batched_tokens=16384,     # 批处理token数
    gpu_memory_utilization=0.95,      # GPU显存利用率
    swap_space=16,                    # CPU-GPU交换空间(GB)
    enforce_eager=False,              # 启用CUDA graph
    trust_remote_code=True,
    dtype="float16",
    quantization="awq",               # AWQ量化
)

# 采样参数
sampling_params = SamplingParams(
    temperature=0.8,
    top_p=0.95,
    max_tokens=512,
    repetition_penalty=1.1,
)

vLLM配置优化要点

# 生产环境vLLM Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-production
spec:
  template:
    spec:
      containers:
      - name: vllm
        env:
        # 核心性能参数
        - name: VLLM_MAX_NUM_SEQS
          value: "256"  # 根据GPU显存调整:A100-80G可设256,A100-40G设128
        - name: VLLM_GPU_MEMORY_UTILIZATION
          value: "0.95"  # 显存利用率95%
        - name: VLLM_SWAP_SPACE
          value: "16"    # 16GB交换空间

        # CUDA优化
        - name: CUDA_VISIBLE_DEVICES
          value: "0,1,2,3"  # 指定GPU设备
        - name: NCCL_DEBUG
          value: "INFO"
        - name: NCCL_SOCKET_IFNAME
          value: "eth0"

        # 性能监控
        - name: VLLM_ENABLE_METRICS
          value: "true"
        - name: VLLM_METRICS_PORT
          value: "8001"

3.3 批处理与动态批处理

Continuous Batching(连续批处理)

vLLM的核心优势 - 动态调整批大小:

传统批处理(Static Batching):
Batch 1: [Req1(500 tokens), Req2(50 tokens), Req3(200 tokens)]
        |----500----|  浪费450 tokens计算

连续批处理(Continuous Batching):
Time 0:  [Req1, Req2, Req3] 开始
Time 1:  [Req1, Req3, Req4] Req2完成,Req4加入
Time 2:  [Req1, Req3, Req4, Req5, Req6] 批大小动态增长

PagedAttention机制

# PagedAttention配置
class PagedAttentionConfig:
    block_size: int = 16          # KV Cache块大小
    max_num_blocks_per_seq: int = 2048  # 每序列最大块数
    gpu_memory_utilization: float = 0.9

    def calculate_blocks(self, max_seq_len: int):
        """计算所需的KV Cache块数"""
        return (max_seq_len + self.block_size - 1) // self.block_size

# 示例:4096 tokens序列需要256个块
config = PagedAttentionConfig()
blocks_needed = config.calculate_blocks(4096)  # 256块

3.4 请求调度与负载均衡

# 使用Envoy作为负载均衡器
apiVersion: v1
kind: ConfigMap
metadata:
  name: envoy-config
data:
  envoy.yaml: |
    static_resources:
      clusters:
      - name: vllm_cluster
        type: STRICT_DNS
        lb_policy: LEAST_REQUEST  # 最少请求负载均衡
        health_checks:
        - timeout: 5s
          interval: 10s
          unhealthy_threshold: 2
          healthy_threshold: 1
          http_health_check:
            path: /health
        load_assignment:
          cluster_name: vllm_cluster
          endpoints:
          - lb_endpoints:
            - endpoint:
                address:
                  socket_address:
                    address: vllm-service
                    port_value: 8000
        circuit_breakers:
          thresholds:
          - priority: DEFAULT
            max_requests: 10000
            max_pending_requests: 5000
            max_retries: 3

自适应限流

# 基于GPU显存使用率的自适应限流
import prometheus_client as prom
from fastapi import FastAPI, HTTPException

app = FastAPI()

# Prometheus指标
gpu_memory_usage = prom.Gauge('gpu_memory_usage_ratio', 'GPU memory usage ratio')
active_requests = prom.Gauge('active_inference_requests', 'Active inference requests')

MAX_CONCURRENT_REQUESTS = 256

@app.middleware("http")
async def adaptive_rate_limit(request, call_next):
    current_requests = active_requests._value.get()
    gpu_mem = gpu_memory_usage._value.get()

    # 动态调整并发限制
    if gpu_mem > 0.95:
        max_allowed = int(MAX_CONCURRENT_REQUESTS * 0.5)  # 降低50%
    elif gpu_mem > 0.90:
        max_allowed = int(MAX_CONCURRENT_REQUESTS * 0.75)
    else:
        max_allowed = MAX_CONCURRENT_REQUESTS

    if current_requests >= max_allowed:
        raise HTTPException(status_code=429, detail="Too many requests")

    active_requests.inc()
    try:
        response = await call_next(request)
        return response
    finally:
        active_requests.dec()

四、成本优化实践

4.1 成本分析

我们的成本构成(月度100张A100):

成本项金额(万元/月)占比优化空间
GPU租赁80080%30%
存储(NFS)505%10%
网络流量303%5%
人力运维12012%-
总计1000100%25%

4.2 GPU利用率优化

Spot实例 + On-Demand混合部署

# NodePool配置 - Spot实例
apiVersion: v1
kind: NodePool
metadata:
  name: gpu-spot-pool
spec:
  nodeTemplate:
    instanceType: GPU-A100-80G-SPOT  # Spot实例
    taints:
    - key: scheduling.karpenter.sh/spot
      value: "true"
      effect: NoSchedule
  limits:
    resources:
      nvidia.com/gpu: 60  # 60%使用Spot
---
# NodePool配置 - On-Demand实例
apiVersion: v1
kind: NodePool
metadata:
  name: gpu-ondemand-pool
spec:
  nodeTemplate:
    instanceType: GPU-A100-80G  # On-Demand实例
  limits:
    resources:
      nvidia.com/gpu: 40  # 40%使用On-Demand

Deployment使用Spot实例

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-spot
spec:
  replicas: 6
  template:
    spec:
      tolerations:
      - key: scheduling.karpenter.sh/spot
        operator: Exists
        effect: NoSchedule
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            preference:
              matchExpressions:
              - key: karpenter.sh/capacity-type
                operator: In
                values: ["spot"]

成本节省:Spot实例价格是On-Demand的30-50%,混合使用节省25%成本。

4.3 模型量化

量化方法精度压缩比吞吐量变化推理质量适用场景
FP16半精度2x基准100%通用
INT88位整数4x+30%98-99%通用
INT44位整数8x+50%95-97%非关键业务
AWQ4位权重7x+45%98%推荐
GPTQ4位权重7x+40%97%备选

AWQ量化实践

# AWQ量化脚本
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "/models/llama2-70b"
quant_path = "/models/llama2-70b-awq"

# 加载模型
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# 量化配置
quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM"
}

# 执行量化
model.quantize(tokenizer, quant_config=quant_config)

# 保存量化模型
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

实际效果(Llama2-70B):

  • 显存占用:140GB → 40GB(减少71%)
  • 单卡支持模型:1个 → 3个(提升3倍)
  • 推理延迟:基本持平
  • 吞吐量:+45%

4.4 弹性伸缩

HPA + KEDA自定义指标

# 安装KEDA
apiVersion: v1
kind: Namespace
metadata:
  name: keda
---
# KEDA ScaledObject - 基于Prometheus指标扩缩容
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: vllm-scaler
  namespace: llm-inference
spec:
  scaleTargetRef:
    name: vllm-llama2-13b
  minReplicaCount: 4
  maxReplicaCount: 20
  pollingInterval: 15
  cooldownPeriod: 300
  triggers:
  # 基于队列长度
  - type: prometheus
    metadata:
      serverAddress: http://prometheus:9090
      metricName: vllm_queue_length
      threshold: '50'
      query: |
        sum(vllm_request_queue_size{job="vllm"})
  # 基于GPU利用率
  - type: prometheus
    metadata:
      serverAddress: http://prometheus:9090
      metricName: gpu_utilization
      threshold: '80'
      query: |
        avg(DCGM_FI_DEV_GPU_UTIL{pod=~"vllm.*"})

定时扩缩容

# CronJob - 业务低峰期缩容
apiVersion: batch/v1
kind: CronJob
metadata:
  name: scale-down-night
spec:
  schedule: "0 2 * * *"  # 凌晨2点
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: scaler
            image: bitnami/kubectl
            command:
            - /bin/sh
            - -c
            - |
              kubectl scale deployment vllm-llama2-13b --replicas=2 -n llm-inference
---
# CronJob - 业务高峰期扩容
apiVersion: batch/v1
kind: CronJob
metadata:
  name: scale-up-morning
spec:
  schedule: "0 8 * * *"  # 早上8点
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: scaler
            image: bitnami/kubectl
            command:
            - /bin/sh
            - -c
            - |
              kubectl scale deployment vllm-llama2-13b --replicas=12 -n llm-inference

4.5 模型共享与缓存

ReadOnlyMany PVC共享模型

# StorageClass - NFS后端
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: nfs-storage
provisioner: nfs.csi.k8s.io
parameters:
  server: nfs-server.default.svc.cluster.local
  share: /models
volumeBindingMode: Immediate
---
# PVC - 所有Pod共享
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: llm-models-shared
spec:
  accessModes:
  - ReadOnlyMany  # 只读多挂载
  storageClassName: nfs-storage
  resources:
    requests:
      storage: 2Ti

模型预热InitContainer

spec:
  initContainers:
  - name: model-downloader
    image: amazon/aws-cli
    command:
    - /bin/sh
    - -c
    - |
      # 从S3下载模型到本地SSD(首次启动)
      if [ ! -f /models/.downloaded ]; then
        aws s3 sync s3://llm-models/llama2-70b /models/llama2-70b
        touch /models/.downloaded
      fi
    volumeMounts:
    - name: local-ssd
      mountPath: /models
  containers:
  - name: vllm
    volumeMounts:
    - name: local-ssd
      mountPath: /models
      readOnly: true
  volumes:
  - name: local-ssd
    hostPath:
      path: /mnt/local-ssd/models  # 节点本地SSD
      type: DirectoryOrCreate

五、稳定性保障

5.1 资源限制与OOM防护

apiVersion: v1
kind: Pod
metadata:
  name: vllm-production
spec:
  containers:
  - name: vllm
    resources:
      requests:
        memory: "100Gi"
        nvidia.com/gpu: 2
        cpu: "16"
      limits:
        memory: "120Gi"  # 留20%buffer
        nvidia.com/gpu: 2
        cpu: "32"
    # OOM评分调整 - 降低被kill的优先级
    securityContext:
      procMount: Default
    env:
    - name: OOM_SCORE_ADJ
      value: "-1000"

5.2 健康检查

apiVersion: v1
kind: Pod
spec:
  containers:
  - name: vllm
    livenessProbe:
      httpGet:
        path: /health
        port: 8000
      initialDelaySeconds: 300  # 模型加载需要时间
      periodSeconds: 30
      timeoutSeconds: 10
      failureThreshold: 3
    readinessProbe:
      httpGet:
        path: /health
        port: 8000
      initialDelaySeconds: 120
      periodSeconds: 10
      timeoutSeconds: 5
      successThreshold: 1
      failureThreshold: 3
    startupProbe:
      httpGet:
        path: /health
        port: 8000
      initialDelaySeconds: 60
      periodSeconds: 10
      failureThreshold: 30  # 最多等待5分钟

5.3 PodDisruptionBudget

# 确保滚动更新时的服务可用性
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: vllm-pdb
spec:
  minAvailable: 6  # 最少保留6个Pod
  selector:
    matchLabels:
      app: vllm-llama2-70b

5.4 监控告警

# PrometheusRule - 告警规则
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: vllm-alerts
spec:
  groups:
  - name: vllm
    interval: 30s
    rules:
    # GPU利用率过低
    - alert: GPUUtilizationLow
      expr: avg(DCGM_FI_DEV_GPU_UTIL{pod=~"vllm.*"}) < 30
      for: 10m
      annotations:
        summary: "GPU利用率过低 ({{ $value }}%)"
        description: "可能需要缩容或调整配置"

    # GPU显存OOM风险
    - alert: GPUMemoryHigh
      expr: avg(DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_FREE * 100) > 95
      for: 5m
      annotations:
        summary: "GPU显存使用率过高 ({{ $value }}%)"

    # 推理延迟过高
    - alert: InferenceLatencyHigh
      expr: histogram_quantile(0.99, rate(vllm_request_duration_seconds_bucket[5m])) > 3
      for: 5m
      annotations:
        summary: "P99推理延迟过高 ({{ $value }}s)"

    # Pod重启频繁
    - alert: PodRestartTooOften
      expr: rate(kube_pod_container_status_restarts_total{pod=~"vllm.*"}[1h]) > 0.1
      for: 10m
      annotations:
        summary: "Pod重启频繁"

六、生产级部署清单

6.1 完整YAML示例

# Namespace
apiVersion: v1
kind: Namespace
metadata:
  name: llm-inference
---
# ServiceAccount
apiVersion: v1
kind: ServiceAccount
metadata:
  name: vllm-sa
  namespace: llm-inference
---
# PriorityClass
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: llm-high-priority
value: 1000000
globalDefault: false
description: "高优先级LLM推理服务"
---
# ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: vllm-config
  namespace: llm-inference
data:
  model_config.json: |
    {
      "model": "/models/llama2-70b-awq",
      "tensor_parallel_size": 4,
      "max_num_seqs": 256,
      "gpu_memory_utilization": 0.95
    }
---
# Service
apiVersion: v1
kind: Service
metadata:
  name: vllm-service
  namespace: llm-inference
spec:
  type: ClusterIP
  selector:
    app: vllm-llama2-70b
  ports:
  - name: http
    port: 8000
    targetPort: 8000
  - name: metrics
    port: 8001
    targetPort: 8001
---
# Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama2-70b
  namespace: llm-inference
spec:
  replicas: 8
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 2
      maxUnavailable: 1
  selector:
    matchLabels:
      app: vllm-llama2-70b
  template:
    metadata:
      labels:
        app: vllm-llama2-70b
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8001"
    spec:
      serviceAccountName: vllm-sa
      priorityClassName: llm-high-priority
      schedulerName: volcano
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: workload.type
                operator: In
                values: ["llm-inference"]
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values: ["vllm-llama2-70b"]
              topologyKey: kubernetes.io/hostname
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      initContainers:
      - name: model-loader
        image: busybox
        command: ['sh', '-c', 'echo "Model ready"']
      containers:
      - name: vllm-server
        image: vllm/vllm-openai:v0.3.0
        command:
        - python3
        - -m
        - vllm.entrypoints.openai.api_server
        args:
        - --model=/models/llama2-70b-awq
        - --tensor-parallel-size=4
        - --max-num-seqs=256
        - --gpu-memory-utilization=0.95
        - --host=0.0.0.0
        - --port=8000
        env:
        - name: CUDA_VISIBLE_DEVICES
          value: "0,1,2,3"
        - name: NCCL_DEBUG
          value: "WARN"
        ports:
        - containerPort: 8000
          name: http
        - containerPort: 8001
          name: metrics
        resources:
          requests:
            nvidia.com/gpu: 4
            memory: "200Gi"
            cpu: "32"
          limits:
            nvidia.com/gpu: 4
            memory: "240Gi"
            cpu: "64"
        volumeMounts:
        - name: model-storage
          mountPath: /models
          readOnly: true
        - name: shm
          mountPath: /dev/shm
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 300
          periodSeconds: 30
          timeoutSeconds: 10
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 120
          periodSeconds: 10
      volumes:
      - name: model-storage
        persistentVolumeClaim:
          claimName: llm-models-shared
      - name: shm
        emptyDir:
          medium: Memory
          sizeLimit: 32Gi
---
# HorizontalPodAutoscaler
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-hpa
  namespace: llm-inference
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-llama2-70b
  minReplicas: 4
  maxReplicas: 20
  metrics:
  - type: Pods
    pods:
      metric:
        name: vllm_queue_length
      target:
        type: AverageValue
        averageValue: "50"

6.2 关键配置检查清单

检查项配置项推荐值说明
GPU资源nvidia.com/gpu2-8根据模型大小
内存resources.requests.memory模型大小×1.5留足buffer
共享内存/dev/shm16-32GiPyTorch多进程通信
并发数max_num_seqs128-512显存越大设越高
显存利用率gpu_memory_utilization0.90-0.95不要设1.0
健康检查initialDelaySeconds300s模型加载需时间
优先级priorityClassNamehigh防止被驱逐

七、性能优化结果

7.1 优化前后对比

指标优化前优化后提升
日调用量300万1000万+233%
峰值QPS15005000+233%
P99延迟850ms280ms67%↓
GPU利用率45%87%93%↑
成本/百万次调用¥320¥9570%↓
可用性98.5%99.95%-

7.2 成本节省明细

  • Spot实例混合:节省25%
  • AWQ量化:单卡容量提升3倍,节省67%
  • 弹性伸缩:非高峰期节省40%
  • 综合成本:降低70%

八、踩坑经验

8.1 GPU显存碎片化

问题:运行一段时间后,显存利用率只有60%但新请求仍然OOM。

原因:PyTorch的显存分配器产生碎片。

解决:

# 启用CUDA内存池
import os
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'max_split_size_mb:128'

# vLLM使用PagedAttention自动解决碎片问题

8.2 NCCL通信超时

问题:多卡推理时频繁出现NCCL timeout错误。

原因:网络配置不当,NCCL使用了错误的网卡。

解决:

# 明确指定网络接口
export NCCL_SOCKET_IFNAME=eth0
export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=1  # 如果没有InfiniBand

8.3 模型加载超时

问题:模型从NFS加载耗时5-10分钟,导致健康检查失败。

解决:

  1. 使用本地SSD缓存模型
  2. 增加startupProbe的failureThreshold
  3. 使用镜像内置模型(针对小模型)

8.4 CPU瓶颈

问题:GPU利用率只有50%,但CPU已经100%。

原因:数据预处理、Tokenization占用大量CPU。

解决:

resources:
  requests:
    cpu: "32"  # 增加CPU请求
  limits:
    cpu: "64"

九、总结与展望

9.1 核心要点

  1. GPU调度:MIG切片+整卡混合,Volcano批量调度
  2. 并发优化:vLLM + Continuous Batching + PagedAttention
  3. 成本优化:Spot实例 + AWQ量化 + 弹性伸缩
  4. 稳定性:充分的资源预留 + 完善的监控告警

9.2 下一步优化方向

  • Speculative Decoding:推测解码加速长文本生成
  • FlashAttention-3:进一步提升Attention计算效率
  • 跨Region容灾:多集群流量调度
  • 在线学习:RLHF在线反馈优化

9.3 参考资源

  • vLLM官方文档
  • Volcano调度器
  • NVIDIA GPU Operator
  • AWQ量化