在K8s上部署大模型推理服务:从0到日均千万调用

一、背景与挑战

1.1 业务规模

我们的大模型推理服务从上线至今经历了快速增长：

日调用量：1000万+ 次
峰值QPS：5000+
模型规模：7B-70B参数不等
GPU集群：100+ 张A100/A800

1.2 核心挑战

挑战领域	具体问题	影响
GPU调度	GPU资源碎片化、利用率低	成本浪费30%+
并发处理	请求排队、响应延迟高	用户体验差
成本控制	GPU成本占总成本80%+	ROI压力大
稳定性	OOM、Pod驱逐频繁	可用性 < 99%

二、GPU调度架构设计

2.1 整体架构

2.2 GPU资源隔离方案

方案对比

方案	GPU利用率	隔离性	复杂度	适用场景
整卡分配	60-70%	强	低	大模型（70B+）
MPS共享	75-85%	中	中	中等模型（7B-13B）
MIG切片	80-90%	强	高	混合负载
时间片轮转	85-95%	弱	中	开发测试环境

我们采用MIG切片 + 整卡分配的混合方案：

# GPU Operator 配置 - 启用MIG
apiVersion: v1
kind: ConfigMap
metadata:
  name: gpu-operator-config
  namespace: gpu-operator
data:
  config.yaml: |
    migManager:
      enabled: true
      config:
        name: "default-mig-config"
        default: "all-balanced"  # 均衡切分策略
    devicePlugin:
      config:
        name: "mig-config"
        default: "all-1g.10gb"   # 默认1g.10gb切片

MIG配置脚本

#!/bin/bash
# 配置A100 MIG实例

NODE_NAME="gpu-node-1"

# 启用MIG模式
kubectl exec -it nvidia-mig-manager-daemonset-xxxxx -n gpu-operator -- \
  nvidia-smi -i 0 -mig 1

# 创建MIG实例 - 7个1g.10gb实例
kubectl exec -it nvidia-mig-manager-daemonset-xxxxx -n gpu-operator -- \
  nvidia-smi mig -cgi 1g.10gb,1g.10gb,1g.10gb,1g.10gb,1g.10gb,1g.10gb,1g.10gb -C

# 创建计算实例
for i in {0..6}; do
  kubectl exec -it nvidia-mig-manager-daemonset-xxxxx -n gpu-operator -- \
    nvidia-smi mig -cci -gi $i
done

# 验证MIG配置
kubectl exec -it nvidia-mig-manager-daemonset-xxxxx -n gpu-operator -- \
  nvidia-smi -L

2.3 节点亲和性与污点配置

# GPU节点分组标签
apiVersion: v1
kind: Node
metadata:
  name: gpu-node-prod-1
  labels:
    node-role.kubernetes.io/gpu: "true"
    gpu.nvidia.com/class: "A100-SXM4-80GB"
    gpu.nvidia.com/mig.capable: "true"
    workload.type: "llm-inference"  # 工作负载类型
    model.size: "70b"                # 支持的模型大小
---
# 添加污点防止非GPU工作负载调度
apiVersion: v1
kind: Node
metadata:
  name: gpu-node-prod-1
spec:
  taints:
  - effect: NoSchedule
    key: nvidia.com/gpu
    value: "true"
  - effect: NoSchedule
    key: workload.type
    value: "llm-inference"

2.4 智能调度策略

使用Volcano Scheduler实现批量调度和队列管理：

# Volcano Queue配置
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: llm-inference-queue
spec:
  weight: 1
  capability:
    nvidia.com/gpu: 40  # 队列最多使用40张GPU
  reclaimable: true
---
# PodGroup配置 - 批量调度
apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:
  name: vllm-workers
  namespace: llm-inference
spec:
  minMember: 4              # 最少启动4个Pod
  queue: llm-inference-queue
  priorityClassName: high-priority
---
# Deployment配置
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama2-70b
  namespace: llm-inference
spec:
  replicas: 8
  template:
    metadata:
      annotations:
        scheduling.volcano.sh/group-name: "vllm-workers"
    spec:
      schedulerName: volcano
      priorityClassName: high-priority
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: model.size
                operator: In
                values: ["70b", "any"]
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values: ["vllm-llama2-70b"]
              topologyKey: kubernetes.io/hostname
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      containers:
      - name: vllm-server
        image: vllm/vllm-openai:v0.3.0
        command:
        - python3
        - -m
        - vllm.entrypoints.openai.api_server
        args:
        - --model=/models/llama2-70b
        - --tensor-parallel-size=4  # 张量并行
        - --max-num-seqs=256        # 最大并发序列
        - --gpu-memory-utilization=0.95
        resources:
          limits:
            nvidia.com/gpu: 4  # 4卡并行
          requests:
            nvidia.com/gpu: 4
            memory: "200Gi"
        volumeMounts:
        - name: model-storage
          mountPath: /models
      volumes:
      - name: model-storage
        persistentVolumeClaim:
          claimName: llm-models-pvc

三、模型并发优化

3.1 推理框架选型

经过实际测试对比（模型：Llama2-13B，输入512 tokens，输出256 tokens）：

框架	吞吐量	P99延迟	GPU利用率	内存占用	特点
vLLM	2400 req/s	180ms	92%	18GB	PagedAttention，开源
TGI	1800 req/s	220ms	85%	20GB	HuggingFace官方
Triton	2100 req/s	200ms	88%	19GB	多框架支持
TensorRT-LLM	2600 req/s	150ms	95%	16GB	最优性能，闭源

最终选型：

主力框架：vLLM（性价比最高）
备用框架：Triton + TensorRT-LLM（关键业务）

3.2 vLLM核心配置

# vLLM配置文件
from vllm import LLM, SamplingParams

llm = LLM(
    model="/models/llama2-13b",
    tensor_parallel_size=2,           # 张量并行度
    pipeline_parallel_size=1,         # 流水线并行度
    max_num_seqs=256,                 # 最大并发序列数
    max_num_batched_tokens=16384,     # 批处理token数
    gpu_memory_utilization=0.95,      # GPU显存利用率
    swap_space=16,                    # CPU-GPU交换空间(GB)
    enforce_eager=False,              # 启用CUDA graph
    trust_remote_code=True,
    dtype="float16",
    quantization="awq",               # AWQ量化
)

# 采样参数
sampling_params = SamplingParams(
    temperature=0.8,
    top_p=0.95,
    max_tokens=512,
    repetition_penalty=1.1,
)

vLLM配置优化要点

# 生产环境vLLM Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-production
spec:
  template:
    spec:
      containers:
      - name: vllm
        env:
        # 核心性能参数
        - name: VLLM_MAX_NUM_SEQS
          value: "256"  # 根据GPU显存调整：A100-80G可设256，A100-40G设128
        - name: VLLM_GPU_MEMORY_UTILIZATION
          value: "0.95"  # 显存利用率95%
        - name: VLLM_SWAP_SPACE
          value: "16"    # 16GB交换空间

        # CUDA优化
        - name: CUDA_VISIBLE_DEVICES
          value: "0,1,2,3"  # 指定GPU设备
        - name: NCCL_DEBUG
          value: "INFO"
        - name: NCCL_SOCKET_IFNAME
          value: "eth0"

        # 性能监控
        - name: VLLM_ENABLE_METRICS
          value: "true"
        - name: VLLM_METRICS_PORT
          value: "8001"

3.3 批处理与动态批处理

Continuous Batching（连续批处理）

vLLM的核心优势 - 动态调整批大小：

传统批处理（Static Batching）：
Batch 1: [Req1(500 tokens), Req2(50 tokens), Req3(200 tokens)]
        |----500----|  浪费450 tokens计算

连续批处理（Continuous Batching）：
Time 0:  [Req1, Req2, Req3] 开始
Time 1:  [Req1, Req3, Req4] Req2完成，Req4加入
Time 2:  [Req1, Req3, Req4, Req5, Req6] 批大小动态增长

PagedAttention机制

# PagedAttention配置
class PagedAttentionConfig:
    block_size: int = 16          # KV Cache块大小
    max_num_blocks_per_seq: int = 2048  # 每序列最大块数
    gpu_memory_utilization: float = 0.9

    def calculate_blocks(self, max_seq_len: int):
        """计算所需的KV Cache块数"""
        return (max_seq_len + self.block_size - 1) // self.block_size

# 示例：4096 tokens序列需要256个块
config = PagedAttentionConfig()
blocks_needed = config.calculate_blocks(4096)  # 256块

3.4 请求调度与负载均衡

# 使用Envoy作为负载均衡器
apiVersion: v1
kind: ConfigMap
metadata:
  name: envoy-config
data:
  envoy.yaml: |
    static_resources:
      clusters:
      - name: vllm_cluster
        type: STRICT_DNS
        lb_policy: LEAST_REQUEST  # 最少请求负载均衡
        health_checks:
        - timeout: 5s
          interval: 10s
          unhealthy_threshold: 2
          healthy_threshold: 1
          http_health_check:
            path: /health
        load_assignment:
          cluster_name: vllm_cluster
          endpoints:
          - lb_endpoints:
            - endpoint:
                address:
                  socket_address:
                    address: vllm-service
                    port_value: 8000
        circuit_breakers:
          thresholds:
          - priority: DEFAULT
            max_requests: 10000
            max_pending_requests: 5000
            max_retries: 3

自适应限流

# 基于GPU显存使用率的自适应限流
import prometheus_client as prom
from fastapi import FastAPI, HTTPException

app = FastAPI()

# Prometheus指标
gpu_memory_usage = prom.Gauge('gpu_memory_usage_ratio', 'GPU memory usage ratio')
active_requests = prom.Gauge('active_inference_requests', 'Active inference requests')

MAX_CONCURRENT_REQUESTS = 256

@app.middleware("http")
async def adaptive_rate_limit(request, call_next):
    current_requests = active_requests._value.get()
    gpu_mem = gpu_memory_usage._value.get()

    # 动态调整并发限制
    if gpu_mem > 0.95:
        max_allowed = int(MAX_CONCURRENT_REQUESTS * 0.5)  # 降低50%
    elif gpu_mem > 0.90:
        max_allowed = int(MAX_CONCURRENT_REQUESTS * 0.75)
    else:
        max_allowed = MAX_CONCURRENT_REQUESTS

    if current_requests >= max_allowed:
        raise HTTPException(status_code=429, detail="Too many requests")

    active_requests.inc()
    try:
        response = await call_next(request)
        return response
    finally:
        active_requests.dec()

四、成本优化实践

4.1 成本分析

我们的成本构成（月度100张A100）：

成本项	金额(万元/月)	占比	优化空间
GPU租赁	800	80%	30%
存储(NFS)	50	5%	10%
网络流量	30	3%	5%
人力运维	120	12%	-
总计	1000	100%	25%

4.2 GPU利用率优化

Spot实例 + On-Demand混合部署

# NodePool配置 - Spot实例
apiVersion: v1
kind: NodePool
metadata:
  name: gpu-spot-pool
spec:
  nodeTemplate:
    instanceType: GPU-A100-80G-SPOT  # Spot实例
    taints:
    - key: scheduling.karpenter.sh/spot
      value: "true"
      effect: NoSchedule
  limits:
    resources:
      nvidia.com/gpu: 60  # 60%使用Spot
---
# NodePool配置 - On-Demand实例
apiVersion: v1
kind: NodePool
metadata:
  name: gpu-ondemand-pool
spec:
  nodeTemplate:
    instanceType: GPU-A100-80G  # On-Demand实例
  limits:
    resources:
      nvidia.com/gpu: 40  # 40%使用On-Demand

Deployment使用Spot实例

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-spot
spec:
  replicas: 6
  template:
    spec:
      tolerations:
      - key: scheduling.karpenter.sh/spot
        operator: Exists
        effect: NoSchedule
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            preference:
              matchExpressions:
              - key: karpenter.sh/capacity-type
                operator: In
                values: ["spot"]

成本节省：Spot实例价格是On-Demand的30-50%，混合使用节省25%成本。

4.3 模型量化

量化方法	精度	压缩比	吞吐量变化	推理质量	适用场景
FP16	半精度	2x	基准	100%	通用
INT8	8位整数	4x	+30%	98-99%	通用
INT4	4位整数	8x	+50%	95-97%	非关键业务
AWQ	4位权重	7x	+45%	98%	推荐
GPTQ	4位权重	7x	+40%	97%	备选

AWQ量化实践

# AWQ量化脚本
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "/models/llama2-70b"
quant_path = "/models/llama2-70b-awq"

# 加载模型
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# 量化配置
quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM"
}

# 执行量化
model.quantize(tokenizer, quant_config=quant_config)

# 保存量化模型
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

实际效果（Llama2-70B）：

显存占用：140GB → 40GB（减少71%）
单卡支持模型：1个 → 3个（提升3倍）
推理延迟：基本持平
吞吐量：+45%

4.4 弹性伸缩

HPA + KEDA自定义指标

# 安装KEDA
apiVersion: v1
kind: Namespace
metadata:
  name: keda
---
# KEDA ScaledObject - 基于Prometheus指标扩缩容
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: vllm-scaler
  namespace: llm-inference
spec:
  scaleTargetRef:
    name: vllm-llama2-13b
  minReplicaCount: 4
  maxReplicaCount: 20
  pollingInterval: 15
  cooldownPeriod: 300
  triggers:
  # 基于队列长度
  - type: prometheus
    metadata:
      serverAddress: http://prometheus:9090
      metricName: vllm_queue_length
      threshold: '50'
      query: |
        sum(vllm_request_queue_size{job="vllm"})
  # 基于GPU利用率
  - type: prometheus
    metadata:
      serverAddress: http://prometheus:9090
      metricName: gpu_utilization
      threshold: '80'
      query: |
        avg(DCGM_FI_DEV_GPU_UTIL{pod=~"vllm.*"})

定时扩缩容

# CronJob - 业务低峰期缩容
apiVersion: batch/v1
kind: CronJob
metadata:
  name: scale-down-night
spec:
  schedule: "0 2 * * *"  # 凌晨2点
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: scaler
            image: bitnami/kubectl
            command:
            - /bin/sh
            - -c
            - |
              kubectl scale deployment vllm-llama2-13b --replicas=2 -n llm-inference
---
# CronJob - 业务高峰期扩容
apiVersion: batch/v1
kind: CronJob
metadata:
  name: scale-up-morning
spec:
  schedule: "0 8 * * *"  # 早上8点
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: scaler
            image: bitnami/kubectl
            command:
            - /bin/sh
            - -c
            - |
              kubectl scale deployment vllm-llama2-13b --replicas=12 -n llm-inference

4.5 模型共享与缓存

ReadOnlyMany PVC共享模型

# StorageClass - NFS后端
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: nfs-storage
provisioner: nfs.csi.k8s.io
parameters:
  server: nfs-server.default.svc.cluster.local
  share: /models
volumeBindingMode: Immediate
---
# PVC - 所有Pod共享
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: llm-models-shared
spec:
  accessModes:
  - ReadOnlyMany  # 只读多挂载
  storageClassName: nfs-storage
  resources:
    requests:
      storage: 2Ti

模型预热InitContainer

spec:
  initContainers:
  - name: model-downloader
    image: amazon/aws-cli
    command:
    - /bin/sh
    - -c
    - |
      # 从S3下载模型到本地SSD（首次启动）
      if [ ! -f /models/.downloaded ]; then
        aws s3 sync s3://llm-models/llama2-70b /models/llama2-70b
        touch /models/.downloaded
      fi
    volumeMounts:
    - name: local-ssd
      mountPath: /models
  containers:
  - name: vllm
    volumeMounts:
    - name: local-ssd
      mountPath: /models
      readOnly: true
  volumes:
  - name: local-ssd
    hostPath:
      path: /mnt/local-ssd/models  # 节点本地SSD
      type: DirectoryOrCreate

五、稳定性保障

5.1 资源限制与OOM防护

apiVersion: v1
kind: Pod
metadata:
  name: vllm-production
spec:
  containers:
  - name: vllm
    resources:
      requests:
        memory: "100Gi"
        nvidia.com/gpu: 2
        cpu: "16"
      limits:
        memory: "120Gi"  # 留20%buffer
        nvidia.com/gpu: 2
        cpu: "32"
    # OOM评分调整 - 降低被kill的优先级
    securityContext:
      procMount: Default
    env:
    - name: OOM_SCORE_ADJ
      value: "-1000"

5.2 健康检查

apiVersion: v1
kind: Pod
spec:
  containers:
  - name: vllm
    livenessProbe:
      httpGet:
        path: /health
        port: 8000
      initialDelaySeconds: 300  # 模型加载需要时间
      periodSeconds: 30
      timeoutSeconds: 10
      failureThreshold: 3
    readinessProbe:
      httpGet:
        path: /health
        port: 8000
      initialDelaySeconds: 120
      periodSeconds: 10
      timeoutSeconds: 5
      successThreshold: 1
      failureThreshold: 3
    startupProbe:
      httpGet:
        path: /health
        port: 8000
      initialDelaySeconds: 60
      periodSeconds: 10
      failureThreshold: 30  # 最多等待5分钟

5.3 PodDisruptionBudget

# 确保滚动更新时的服务可用性
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: vllm-pdb
spec:
  minAvailable: 6  # 最少保留6个Pod
  selector:
    matchLabels:
      app: vllm-llama2-70b

5.4 监控告警

# PrometheusRule - 告警规则
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: vllm-alerts
spec:
  groups:
  - name: vllm
    interval: 30s
    rules:
    # GPU利用率过低
    - alert: GPUUtilizationLow
      expr: avg(DCGM_FI_DEV_GPU_UTIL{pod=~"vllm.*"}) < 30
      for: 10m
      annotations:
        summary: "GPU利用率过低 ({{ $value }}%)"
        description: "可能需要缩容或调整配置"

    # GPU显存OOM风险
    - alert: GPUMemoryHigh
      expr: avg(DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_FREE * 100) > 95
      for: 5m
      annotations:
        summary: "GPU显存使用率过高 ({{ $value }}%)"

    # 推理延迟过高
    - alert: InferenceLatencyHigh
      expr: histogram_quantile(0.99, rate(vllm_request_duration_seconds_bucket[5m])) > 3
      for: 5m
      annotations:
        summary: "P99推理延迟过高 ({{ $value }}s)"

    # Pod重启频繁
    - alert: PodRestartTooOften
      expr: rate(kube_pod_container_status_restarts_total{pod=~"vllm.*"}[1h]) > 0.1
      for: 10m
      annotations:
        summary: "Pod重启频繁"

六、生产级部署清单

6.1 完整YAML示例

# Namespace
apiVersion: v1
kind: Namespace
metadata:
  name: llm-inference
---
# ServiceAccount
apiVersion: v1
kind: ServiceAccount
metadata:
  name: vllm-sa
  namespace: llm-inference
---
# PriorityClass
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: llm-high-priority
value: 1000000
globalDefault: false
description: "高优先级LLM推理服务"
---
# ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: vllm-config
  namespace: llm-inference
data:
  model_config.json: |
    {
      "model": "/models/llama2-70b-awq",
      "tensor_parallel_size": 4,
      "max_num_seqs": 256,
      "gpu_memory_utilization": 0.95
    }
---
# Service
apiVersion: v1
kind: Service
metadata:
  name: vllm-service
  namespace: llm-inference
spec:
  type: ClusterIP
  selector:
    app: vllm-llama2-70b
  ports:
  - name: http
    port: 8000
    targetPort: 8000
  - name: metrics
    port: 8001
    targetPort: 8001
---
# Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama2-70b
  namespace: llm-inference
spec:
  replicas: 8
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 2
      maxUnavailable: 1
  selector:
    matchLabels:
      app: vllm-llama2-70b
  template:
    metadata:
      labels:
        app: vllm-llama2-70b
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8001"
    spec:
      serviceAccountName: vllm-sa
      priorityClassName: llm-high-priority
      schedulerName: volcano
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: workload.type
                operator: In
                values: ["llm-inference"]
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values: ["vllm-llama2-70b"]
              topologyKey: kubernetes.io/hostname
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      initContainers:
      - name: model-loader
        image: busybox
        command: ['sh', '-c', 'echo "Model ready"']
      containers:
      - name: vllm-server
        image: vllm/vllm-openai:v0.3.0
        command:
        - python3
        - -m
        - vllm.entrypoints.openai.api_server
        args:
        - --model=/models/llama2-70b-awq
        - --tensor-parallel-size=4
        - --max-num-seqs=256
        - --gpu-memory-utilization=0.95
        - --host=0.0.0.0
        - --port=8000
        env:
        - name: CUDA_VISIBLE_DEVICES
          value: "0,1,2,3"
        - name: NCCL_DEBUG
          value: "WARN"
        ports:
        - containerPort: 8000
          name: http
        - containerPort: 8001
          name: metrics
        resources:
          requests:
            nvidia.com/gpu: 4
            memory: "200Gi"
            cpu: "32"
          limits:
            nvidia.com/gpu: 4
            memory: "240Gi"
            cpu: "64"
        volumeMounts:
        - name: model-storage
          mountPath: /models
          readOnly: true
        - name: shm
          mountPath: /dev/shm
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 300
          periodSeconds: 30
          timeoutSeconds: 10
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 120
          periodSeconds: 10
      volumes:
      - name: model-storage
        persistentVolumeClaim:
          claimName: llm-models-shared
      - name: shm
        emptyDir:
          medium: Memory
          sizeLimit: 32Gi
---
# HorizontalPodAutoscaler
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-hpa
  namespace: llm-inference
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-llama2-70b
  minReplicas: 4
  maxReplicas: 20
  metrics:
  - type: Pods
    pods:
      metric:
        name: vllm_queue_length
      target:
        type: AverageValue
        averageValue: "50"

6.2 关键配置检查清单

检查项	配置项	推荐值	说明
GPU资源	`nvidia.com/gpu`	2-8	根据模型大小
内存	`resources.requests.memory`	模型大小×1.5	留足buffer
共享内存	`/dev/shm`	16-32Gi	PyTorch多进程通信
并发数	`max_num_seqs`	128-512	显存越大设越高
显存利用率	`gpu_memory_utilization`	0.90-0.95	不要设1.0
健康检查	`initialDelaySeconds`	300s	模型加载需时间
优先级	`priorityClassName`	high	防止被驱逐

七、性能优化结果

7.1 优化前后对比

指标	优化前	优化后	提升
日调用量	300万	1000万+	233%
峰值QPS	1500	5000+	233%
P99延迟	850ms	280ms	67%↓
GPU利用率	45%	87%	93%↑
成本/百万次调用	¥320	¥95	70%↓
可用性	98.5%	99.95%	-

7.2 成本节省明细

Spot实例混合：节省25%
AWQ量化：单卡容量提升3倍，节省67%
弹性伸缩：非高峰期节省40%
综合成本：降低70%

八、踩坑经验

8.1 GPU显存碎片化

问题：运行一段时间后，显存利用率只有60%但新请求仍然OOM。

原因：PyTorch的显存分配器产生碎片。

解决：

# 启用CUDA内存池
import os
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'max_split_size_mb:128'

# vLLM使用PagedAttention自动解决碎片问题

8.2 NCCL通信超时

问题：多卡推理时频繁出现NCCL timeout错误。

原因：网络配置不当，NCCL使用了错误的网卡。

解决：

# 明确指定网络接口
export NCCL_SOCKET_IFNAME=eth0
export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=1  # 如果没有InfiniBand

8.3 模型加载超时

问题：模型从NFS加载耗时5-10分钟，导致健康检查失败。

解决：

使用本地SSD缓存模型
增加startupProbe的failureThreshold
使用镜像内置模型（针对小模型）

8.4 CPU瓶颈

问题：GPU利用率只有50%，但CPU已经100%。

原因：数据预处理、Tokenization占用大量CPU。

解决：

resources:
  requests:
    cpu: "32"  # 增加CPU请求
  limits:
    cpu: "64"