在K8s上部署大模型推理服务:从0到日均千万调用
一、背景与挑战
1.1 业务规模
我们的大模型推理服务从上线至今经历了快速增长:
- 日调用量:1000万+ 次
- 峰值QPS:5000+
- 模型规模:7B-70B参数不等
- GPU集群:100+ 张A100/A800
1.2 核心挑战
| 挑战领域 | 具体问题 | 影响 |
|---|---|---|
| GPU调度 | GPU资源碎片化、利用率低 | 成本浪费30%+ |
| 并发处理 | 请求排队、响应延迟高 | 用户体验差 |
| 成本控制 | GPU成本占总成本80%+ | ROI压力大 |
| 稳定性 | OOM、Pod驱逐频繁 | 可用性 < 99% |
二、GPU调度架构设计
2.1 整体架构
2.2 GPU资源隔离方案
方案对比
| 方案 | GPU利用率 | 隔离性 | 复杂度 | 适用场景 |
|---|---|---|---|---|
| 整卡分配 | 60-70% | 强 | 低 | 大模型(70B+) |
| MPS共享 | 75-85% | 中 | 中 | 中等模型(7B-13B) |
| MIG切片 | 80-90% | 强 | 高 | 混合负载 |
| 时间片轮转 | 85-95% | 弱 | 中 | 开发测试环境 |
我们采用MIG切片 + 整卡分配的混合方案:
# GPU Operator 配置 - 启用MIG
apiVersion: v1
kind: ConfigMap
metadata:
name: gpu-operator-config
namespace: gpu-operator
data:
config.yaml: |
migManager:
enabled: true
config:
name: "default-mig-config"
default: "all-balanced" # 均衡切分策略
devicePlugin:
config:
name: "mig-config"
default: "all-1g.10gb" # 默认1g.10gb切片
MIG配置脚本
#!/bin/bash
# 配置A100 MIG实例
NODE_NAME="gpu-node-1"
# 启用MIG模式
kubectl exec -it nvidia-mig-manager-daemonset-xxxxx -n gpu-operator -- \
nvidia-smi -i 0 -mig 1
# 创建MIG实例 - 7个1g.10gb实例
kubectl exec -it nvidia-mig-manager-daemonset-xxxxx -n gpu-operator -- \
nvidia-smi mig -cgi 1g.10gb,1g.10gb,1g.10gb,1g.10gb,1g.10gb,1g.10gb,1g.10gb -C
# 创建计算实例
for i in {0..6}; do
kubectl exec -it nvidia-mig-manager-daemonset-xxxxx -n gpu-operator -- \
nvidia-smi mig -cci -gi $i
done
# 验证MIG配置
kubectl exec -it nvidia-mig-manager-daemonset-xxxxx -n gpu-operator -- \
nvidia-smi -L
2.3 节点亲和性与污点配置
# GPU节点分组标签
apiVersion: v1
kind: Node
metadata:
name: gpu-node-prod-1
labels:
node-role.kubernetes.io/gpu: "true"
gpu.nvidia.com/class: "A100-SXM4-80GB"
gpu.nvidia.com/mig.capable: "true"
workload.type: "llm-inference" # 工作负载类型
model.size: "70b" # 支持的模型大小
---
# 添加污点防止非GPU工作负载调度
apiVersion: v1
kind: Node
metadata:
name: gpu-node-prod-1
spec:
taints:
- effect: NoSchedule
key: nvidia.com/gpu
value: "true"
- effect: NoSchedule
key: workload.type
value: "llm-inference"
2.4 智能调度策略
使用Volcano Scheduler实现批量调度和队列管理:
# Volcano Queue配置
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: llm-inference-queue
spec:
weight: 1
capability:
nvidia.com/gpu: 40 # 队列最多使用40张GPU
reclaimable: true
---
# PodGroup配置 - 批量调度
apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:
name: vllm-workers
namespace: llm-inference
spec:
minMember: 4 # 最少启动4个Pod
queue: llm-inference-queue
priorityClassName: high-priority
---
# Deployment配置
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-llama2-70b
namespace: llm-inference
spec:
replicas: 8
template:
metadata:
annotations:
scheduling.volcano.sh/group-name: "vllm-workers"
spec:
schedulerName: volcano
priorityClassName: high-priority
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: model.size
operator: In
values: ["70b", "any"]
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values: ["vllm-llama2-70b"]
topologyKey: kubernetes.io/hostname
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- name: vllm-server
image: vllm/vllm-openai:v0.3.0
command:
- python3
- -m
- vllm.entrypoints.openai.api_server
args:
- --model=/models/llama2-70b
- --tensor-parallel-size=4 # 张量并行
- --max-num-seqs=256 # 最大并发序列
- --gpu-memory-utilization=0.95
resources:
limits:
nvidia.com/gpu: 4 # 4卡并行
requests:
nvidia.com/gpu: 4
memory: "200Gi"
volumeMounts:
- name: model-storage
mountPath: /models
volumes:
- name: model-storage
persistentVolumeClaim:
claimName: llm-models-pvc
三、模型并发优化
3.1 推理框架选型
经过实际测试对比(模型:Llama2-13B,输入512 tokens,输出256 tokens):
| 框架 | 吞吐量 | P99延迟 | GPU利用率 | 内存占用 | 特点 |
|---|---|---|---|---|---|
| vLLM | 2400 req/s | 180ms | 92% | 18GB | PagedAttention,开源 |
| TGI | 1800 req/s | 220ms | 85% | 20GB | HuggingFace官方 |
| Triton | 2100 req/s | 200ms | 88% | 19GB | 多框架支持 |
| TensorRT-LLM | 2600 req/s | 150ms | 95% | 16GB | 最优性能,闭源 |
最终选型:
- 主力框架:vLLM(性价比最高)
- 备用框架:Triton + TensorRT-LLM(关键业务)
3.2 vLLM核心配置
# vLLM配置文件
from vllm import LLM, SamplingParams
llm = LLM(
model="/models/llama2-13b",
tensor_parallel_size=2, # 张量并行度
pipeline_parallel_size=1, # 流水线并行度
max_num_seqs=256, # 最大并发序列数
max_num_batched_tokens=16384, # 批处理token数
gpu_memory_utilization=0.95, # GPU显存利用率
swap_space=16, # CPU-GPU交换空间(GB)
enforce_eager=False, # 启用CUDA graph
trust_remote_code=True,
dtype="float16",
quantization="awq", # AWQ量化
)
# 采样参数
sampling_params = SamplingParams(
temperature=0.8,
top_p=0.95,
max_tokens=512,
repetition_penalty=1.1,
)
vLLM配置优化要点
# 生产环境vLLM Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-production
spec:
template:
spec:
containers:
- name: vllm
env:
# 核心性能参数
- name: VLLM_MAX_NUM_SEQS
value: "256" # 根据GPU显存调整:A100-80G可设256,A100-40G设128
- name: VLLM_GPU_MEMORY_UTILIZATION
value: "0.95" # 显存利用率95%
- name: VLLM_SWAP_SPACE
value: "16" # 16GB交换空间
# CUDA优化
- name: CUDA_VISIBLE_DEVICES
value: "0,1,2,3" # 指定GPU设备
- name: NCCL_DEBUG
value: "INFO"
- name: NCCL_SOCKET_IFNAME
value: "eth0"
# 性能监控
- name: VLLM_ENABLE_METRICS
value: "true"
- name: VLLM_METRICS_PORT
value: "8001"
3.3 批处理与动态批处理
Continuous Batching(连续批处理)
vLLM的核心优势 - 动态调整批大小:
传统批处理(Static Batching):
Batch 1: [Req1(500 tokens), Req2(50 tokens), Req3(200 tokens)]
|----500----| 浪费450 tokens计算
连续批处理(Continuous Batching):
Time 0: [Req1, Req2, Req3] 开始
Time 1: [Req1, Req3, Req4] Req2完成,Req4加入
Time 2: [Req1, Req3, Req4, Req5, Req6] 批大小动态增长
PagedAttention机制
# PagedAttention配置
class PagedAttentionConfig:
block_size: int = 16 # KV Cache块大小
max_num_blocks_per_seq: int = 2048 # 每序列最大块数
gpu_memory_utilization: float = 0.9
def calculate_blocks(self, max_seq_len: int):
"""计算所需的KV Cache块数"""
return (max_seq_len + self.block_size - 1) // self.block_size
# 示例:4096 tokens序列需要256个块
config = PagedAttentionConfig()
blocks_needed = config.calculate_blocks(4096) # 256块
3.4 请求调度与负载均衡
# 使用Envoy作为负载均衡器
apiVersion: v1
kind: ConfigMap
metadata:
name: envoy-config
data:
envoy.yaml: |
static_resources:
clusters:
- name: vllm_cluster
type: STRICT_DNS
lb_policy: LEAST_REQUEST # 最少请求负载均衡
health_checks:
- timeout: 5s
interval: 10s
unhealthy_threshold: 2
healthy_threshold: 1
http_health_check:
path: /health
load_assignment:
cluster_name: vllm_cluster
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: vllm-service
port_value: 8000
circuit_breakers:
thresholds:
- priority: DEFAULT
max_requests: 10000
max_pending_requests: 5000
max_retries: 3
自适应限流
# 基于GPU显存使用率的自适应限流
import prometheus_client as prom
from fastapi import FastAPI, HTTPException
app = FastAPI()
# Prometheus指标
gpu_memory_usage = prom.Gauge('gpu_memory_usage_ratio', 'GPU memory usage ratio')
active_requests = prom.Gauge('active_inference_requests', 'Active inference requests')
MAX_CONCURRENT_REQUESTS = 256
@app.middleware("http")
async def adaptive_rate_limit(request, call_next):
current_requests = active_requests._value.get()
gpu_mem = gpu_memory_usage._value.get()
# 动态调整并发限制
if gpu_mem > 0.95:
max_allowed = int(MAX_CONCURRENT_REQUESTS * 0.5) # 降低50%
elif gpu_mem > 0.90:
max_allowed = int(MAX_CONCURRENT_REQUESTS * 0.75)
else:
max_allowed = MAX_CONCURRENT_REQUESTS
if current_requests >= max_allowed:
raise HTTPException(status_code=429, detail="Too many requests")
active_requests.inc()
try:
response = await call_next(request)
return response
finally:
active_requests.dec()
四、成本优化实践
4.1 成本分析
我们的成本构成(月度100张A100):
| 成本项 | 金额(万元/月) | 占比 | 优化空间 |
|---|---|---|---|
| GPU租赁 | 800 | 80% | 30% |
| 存储(NFS) | 50 | 5% | 10% |
| 网络流量 | 30 | 3% | 5% |
| 人力运维 | 120 | 12% | - |
| 总计 | 1000 | 100% | 25% |
4.2 GPU利用率优化
Spot实例 + On-Demand混合部署
# NodePool配置 - Spot实例
apiVersion: v1
kind: NodePool
metadata:
name: gpu-spot-pool
spec:
nodeTemplate:
instanceType: GPU-A100-80G-SPOT # Spot实例
taints:
- key: scheduling.karpenter.sh/spot
value: "true"
effect: NoSchedule
limits:
resources:
nvidia.com/gpu: 60 # 60%使用Spot
---
# NodePool配置 - On-Demand实例
apiVersion: v1
kind: NodePool
metadata:
name: gpu-ondemand-pool
spec:
nodeTemplate:
instanceType: GPU-A100-80G # On-Demand实例
limits:
resources:
nvidia.com/gpu: 40 # 40%使用On-Demand
Deployment使用Spot实例
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-spot
spec:
replicas: 6
template:
spec:
tolerations:
- key: scheduling.karpenter.sh/spot
operator: Exists
effect: NoSchedule
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot"]
成本节省:Spot实例价格是On-Demand的30-50%,混合使用节省25%成本。
4.3 模型量化
| 量化方法 | 精度 | 压缩比 | 吞吐量变化 | 推理质量 | 适用场景 |
|---|---|---|---|---|---|
| FP16 | 半精度 | 2x | 基准 | 100% | 通用 |
| INT8 | 8位整数 | 4x | +30% | 98-99% | 通用 |
| INT4 | 4位整数 | 8x | +50% | 95-97% | 非关键业务 |
| AWQ | 4位权重 | 7x | +45% | 98% | 推荐 |
| GPTQ | 4位权重 | 7x | +40% | 97% | 备选 |
AWQ量化实践
# AWQ量化脚本
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = "/models/llama2-70b"
quant_path = "/models/llama2-70b-awq"
# 加载模型
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
# 量化配置
quant_config = {
"zero_point": True,
"q_group_size": 128,
"w_bit": 4,
"version": "GEMM"
}
# 执行量化
model.quantize(tokenizer, quant_config=quant_config)
# 保存量化模型
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
实际效果(Llama2-70B):
- 显存占用:140GB → 40GB(减少71%)
- 单卡支持模型:1个 → 3个(提升3倍)
- 推理延迟:基本持平
- 吞吐量:+45%
4.4 弹性伸缩
HPA + KEDA自定义指标
# 安装KEDA
apiVersion: v1
kind: Namespace
metadata:
name: keda
---
# KEDA ScaledObject - 基于Prometheus指标扩缩容
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: vllm-scaler
namespace: llm-inference
spec:
scaleTargetRef:
name: vllm-llama2-13b
minReplicaCount: 4
maxReplicaCount: 20
pollingInterval: 15
cooldownPeriod: 300
triggers:
# 基于队列长度
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
metricName: vllm_queue_length
threshold: '50'
query: |
sum(vllm_request_queue_size{job="vllm"})
# 基于GPU利用率
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
metricName: gpu_utilization
threshold: '80'
query: |
avg(DCGM_FI_DEV_GPU_UTIL{pod=~"vllm.*"})
定时扩缩容
# CronJob - 业务低峰期缩容
apiVersion: batch/v1
kind: CronJob
metadata:
name: scale-down-night
spec:
schedule: "0 2 * * *" # 凌晨2点
jobTemplate:
spec:
template:
spec:
containers:
- name: scaler
image: bitnami/kubectl
command:
- /bin/sh
- -c
- |
kubectl scale deployment vllm-llama2-13b --replicas=2 -n llm-inference
---
# CronJob - 业务高峰期扩容
apiVersion: batch/v1
kind: CronJob
metadata:
name: scale-up-morning
spec:
schedule: "0 8 * * *" # 早上8点
jobTemplate:
spec:
template:
spec:
containers:
- name: scaler
image: bitnami/kubectl
command:
- /bin/sh
- -c
- |
kubectl scale deployment vllm-llama2-13b --replicas=12 -n llm-inference
4.5 模型共享与缓存
ReadOnlyMany PVC共享模型
# StorageClass - NFS后端
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: nfs-storage
provisioner: nfs.csi.k8s.io
parameters:
server: nfs-server.default.svc.cluster.local
share: /models
volumeBindingMode: Immediate
---
# PVC - 所有Pod共享
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: llm-models-shared
spec:
accessModes:
- ReadOnlyMany # 只读多挂载
storageClassName: nfs-storage
resources:
requests:
storage: 2Ti
模型预热InitContainer
spec:
initContainers:
- name: model-downloader
image: amazon/aws-cli
command:
- /bin/sh
- -c
- |
# 从S3下载模型到本地SSD(首次启动)
if [ ! -f /models/.downloaded ]; then
aws s3 sync s3://llm-models/llama2-70b /models/llama2-70b
touch /models/.downloaded
fi
volumeMounts:
- name: local-ssd
mountPath: /models
containers:
- name: vllm
volumeMounts:
- name: local-ssd
mountPath: /models
readOnly: true
volumes:
- name: local-ssd
hostPath:
path: /mnt/local-ssd/models # 节点本地SSD
type: DirectoryOrCreate
五、稳定性保障
5.1 资源限制与OOM防护
apiVersion: v1
kind: Pod
metadata:
name: vllm-production
spec:
containers:
- name: vllm
resources:
requests:
memory: "100Gi"
nvidia.com/gpu: 2
cpu: "16"
limits:
memory: "120Gi" # 留20%buffer
nvidia.com/gpu: 2
cpu: "32"
# OOM评分调整 - 降低被kill的优先级
securityContext:
procMount: Default
env:
- name: OOM_SCORE_ADJ
value: "-1000"
5.2 健康检查
apiVersion: v1
kind: Pod
spec:
containers:
- name: vllm
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 300 # 模型加载需要时间
periodSeconds: 30
timeoutSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 120
periodSeconds: 10
timeoutSeconds: 5
successThreshold: 1
failureThreshold: 3
startupProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 10
failureThreshold: 30 # 最多等待5分钟
5.3 PodDisruptionBudget
# 确保滚动更新时的服务可用性
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: vllm-pdb
spec:
minAvailable: 6 # 最少保留6个Pod
selector:
matchLabels:
app: vllm-llama2-70b
5.4 监控告警
# PrometheusRule - 告警规则
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: vllm-alerts
spec:
groups:
- name: vllm
interval: 30s
rules:
# GPU利用率过低
- alert: GPUUtilizationLow
expr: avg(DCGM_FI_DEV_GPU_UTIL{pod=~"vllm.*"}) < 30
for: 10m
annotations:
summary: "GPU利用率过低 ({{ $value }}%)"
description: "可能需要缩容或调整配置"
# GPU显存OOM风险
- alert: GPUMemoryHigh
expr: avg(DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_FREE * 100) > 95
for: 5m
annotations:
summary: "GPU显存使用率过高 ({{ $value }}%)"
# 推理延迟过高
- alert: InferenceLatencyHigh
expr: histogram_quantile(0.99, rate(vllm_request_duration_seconds_bucket[5m])) > 3
for: 5m
annotations:
summary: "P99推理延迟过高 ({{ $value }}s)"
# Pod重启频繁
- alert: PodRestartTooOften
expr: rate(kube_pod_container_status_restarts_total{pod=~"vllm.*"}[1h]) > 0.1
for: 10m
annotations:
summary: "Pod重启频繁"
六、生产级部署清单
6.1 完整YAML示例
# Namespace
apiVersion: v1
kind: Namespace
metadata:
name: llm-inference
---
# ServiceAccount
apiVersion: v1
kind: ServiceAccount
metadata:
name: vllm-sa
namespace: llm-inference
---
# PriorityClass
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: llm-high-priority
value: 1000000
globalDefault: false
description: "高优先级LLM推理服务"
---
# ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
name: vllm-config
namespace: llm-inference
data:
model_config.json: |
{
"model": "/models/llama2-70b-awq",
"tensor_parallel_size": 4,
"max_num_seqs": 256,
"gpu_memory_utilization": 0.95
}
---
# Service
apiVersion: v1
kind: Service
metadata:
name: vllm-service
namespace: llm-inference
spec:
type: ClusterIP
selector:
app: vllm-llama2-70b
ports:
- name: http
port: 8000
targetPort: 8000
- name: metrics
port: 8001
targetPort: 8001
---
# Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-llama2-70b
namespace: llm-inference
spec:
replicas: 8
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 2
maxUnavailable: 1
selector:
matchLabels:
app: vllm-llama2-70b
template:
metadata:
labels:
app: vllm-llama2-70b
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8001"
spec:
serviceAccountName: vllm-sa
priorityClassName: llm-high-priority
schedulerName: volcano
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: workload.type
operator: In
values: ["llm-inference"]
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values: ["vllm-llama2-70b"]
topologyKey: kubernetes.io/hostname
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
initContainers:
- name: model-loader
image: busybox
command: ['sh', '-c', 'echo "Model ready"']
containers:
- name: vllm-server
image: vllm/vllm-openai:v0.3.0
command:
- python3
- -m
- vllm.entrypoints.openai.api_server
args:
- --model=/models/llama2-70b-awq
- --tensor-parallel-size=4
- --max-num-seqs=256
- --gpu-memory-utilization=0.95
- --host=0.0.0.0
- --port=8000
env:
- name: CUDA_VISIBLE_DEVICES
value: "0,1,2,3"
- name: NCCL_DEBUG
value: "WARN"
ports:
- containerPort: 8000
name: http
- containerPort: 8001
name: metrics
resources:
requests:
nvidia.com/gpu: 4
memory: "200Gi"
cpu: "32"
limits:
nvidia.com/gpu: 4
memory: "240Gi"
cpu: "64"
volumeMounts:
- name: model-storage
mountPath: /models
readOnly: true
- name: shm
mountPath: /dev/shm
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 300
periodSeconds: 30
timeoutSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 120
periodSeconds: 10
volumes:
- name: model-storage
persistentVolumeClaim:
claimName: llm-models-shared
- name: shm
emptyDir:
medium: Memory
sizeLimit: 32Gi
---
# HorizontalPodAutoscaler
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: vllm-hpa
namespace: llm-inference
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm-llama2-70b
minReplicas: 4
maxReplicas: 20
metrics:
- type: Pods
pods:
metric:
name: vllm_queue_length
target:
type: AverageValue
averageValue: "50"
6.2 关键配置检查清单
| 检查项 | 配置项 | 推荐值 | 说明 |
|---|---|---|---|
| GPU资源 | nvidia.com/gpu | 2-8 | 根据模型大小 |
| 内存 | resources.requests.memory | 模型大小×1.5 | 留足buffer |
| 共享内存 | /dev/shm | 16-32Gi | PyTorch多进程通信 |
| 并发数 | max_num_seqs | 128-512 | 显存越大设越高 |
| 显存利用率 | gpu_memory_utilization | 0.90-0.95 | 不要设1.0 |
| 健康检查 | initialDelaySeconds | 300s | 模型加载需时间 |
| 优先级 | priorityClassName | high | 防止被驱逐 |
七、性能优化结果
7.1 优化前后对比
| 指标 | 优化前 | 优化后 | 提升 |
|---|---|---|---|
| 日调用量 | 300万 | 1000万+ | 233% |
| 峰值QPS | 1500 | 5000+ | 233% |
| P99延迟 | 850ms | 280ms | 67%↓ |
| GPU利用率 | 45% | 87% | 93%↑ |
| 成本/百万次调用 | ¥320 | ¥95 | 70%↓ |
| 可用性 | 98.5% | 99.95% | - |
7.2 成本节省明细
- Spot实例混合:节省25%
- AWQ量化:单卡容量提升3倍,节省67%
- 弹性伸缩:非高峰期节省40%
- 综合成本:降低70%
八、踩坑经验
8.1 GPU显存碎片化
问题:运行一段时间后,显存利用率只有60%但新请求仍然OOM。
原因:PyTorch的显存分配器产生碎片。
解决:
# 启用CUDA内存池
import os
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'max_split_size_mb:128'
# vLLM使用PagedAttention自动解决碎片问题
8.2 NCCL通信超时
问题:多卡推理时频繁出现NCCL timeout错误。
原因:网络配置不当,NCCL使用了错误的网卡。
解决:
# 明确指定网络接口
export NCCL_SOCKET_IFNAME=eth0
export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=1 # 如果没有InfiniBand
8.3 模型加载超时
问题:模型从NFS加载耗时5-10分钟,导致健康检查失败。
解决:
- 使用本地SSD缓存模型
- 增加
startupProbe的failureThreshold - 使用镜像内置模型(针对小模型)
8.4 CPU瓶颈
问题:GPU利用率只有50%,但CPU已经100%。
原因:数据预处理、Tokenization占用大量CPU。
解决:
resources:
requests:
cpu: "32" # 增加CPU请求
limits:
cpu: "64"
九、总结与展望
9.1 核心要点
- GPU调度:MIG切片+整卡混合,Volcano批量调度
- 并发优化:vLLM + Continuous Batching + PagedAttention
- 成本优化:Spot实例 + AWQ量化 + 弹性伸缩
- 稳定性:充分的资源预留 + 完善的监控告警
9.2 下一步优化方向
- Speculative Decoding:推测解码加速长文本生成
- FlashAttention-3:进一步提升Attention计算效率
- 跨Region容灾:多集群流量调度
- 在线学习:RLHF在线反馈优化