第4章:灰度发布与蓝绿部署
发布策略概述
发布策略对比
| 策略 | 特点 | 优点 | 缺点 | 适用场景 |
|---|---|---|---|---|
| 蓝绿部署 | 两套环境,瞬间切换 | 回滚快速 | 成本高(双倍资源) | 大版本更新 |
| 灰度发布 | 逐步放量 | 风险可控 | 发布周期长 | 常规发布 |
| 金丝雀发布 | 小流量验证 | 最安全 | 需要自动化 | 重要更新 |
| 滚动更新 | 逐个替换 | 资源利用率高 | 回滚复杂 | 无状态服务 |
| A/B测试 | 多版本并行 | 数据驱动 | 实现复杂 | 功能验证 |
选择建议
蓝绿部署:
大版本更新
需要快速回滚
资源充足
灰度发布:
常规版本迭代
风险控制要求高
用户体验优先
金丝雀发布:
核心服务更新
零容忍故障
有自动化平台
滚动更新:
无状态应用
小版本更新
资源有限
A/B测试:
新功能试验
需要数据对比
产品迭代
蓝绿部署
原理
两套完全相同的环境(蓝/绿),瞬间切换流量
初始状态(蓝环境服务用户):
┌──────────────┐
│Load Balancer │
└──────────────┘
↓ 100%
┌──────────────┐ ┌──────────────┐
│ Blue (v1) │ │ Green (v2) │
│ 运行中 │ │ ✗ 待机 │
└──────────────┘ └──────────────┘
切换后(绿环境服务用户):
┌──────────────┐
│Load Balancer │
└──────────────┘
↓ 100%
┌──────────────┐ ┌──────────────┐
│ Blue (v1) │ │ Green (v2) │
│ ✗ 待机 │ │ 运行中 │
└──────────────┘ └──────────────┘
回滚(切回蓝环境):
┌──────────────┐
│Load Balancer │
└──────────────┘
↓ 100%
┌──────────────┐ ┌──────────────┐
│ Blue (v1) │ │ Green (v2) │
│ 运行中 │ │ ✗ 待机 │
└──────────────┘ └──────────────┘
Kubernetes实现
方案1:Service切换Selector
# blue-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp-blue
spec:
replicas: 3
selector:
matchLabels:
app: myapp
version: blue
template:
metadata:
labels:
app: myapp
version: blue
spec:
containers:
- name: myapp
image: myapp:v1
ports:
- containerPort: 8080
---
# green-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp-green
spec:
replicas: 3
selector:
matchLabels:
app: myapp
version: green
template:
metadata:
labels:
app: myapp
version: green
spec:
containers:
- name: myapp
image: myapp:v2 # 新版本
ports:
- containerPort: 8080
---
# service.yaml
apiVersion: v1
kind: Service
metadata:
name: myapp
spec:
selector:
app: myapp
version: blue # 初始指向蓝环境
ports:
- port: 80
targetPort: 8080
切换流程:
# 1. 部署蓝环境(当前版本v1)
kubectl apply -f blue-deployment.yaml
# 2. 部署绿环境(新版本v2)
kubectl apply -f green-deployment.yaml
# 3. 验证绿环境
kubectl port-forward deployment/myapp-green 8080:8080
curl http://localhost:8080
# 4. 切换到绿环境(修改Service selector)
kubectl patch service myapp -p '{"spec":{"selector":{"version":"green"}}}'
# 5. 验证切换
curl http://myapp-service
# 6. 回滚(如果有问题)
kubectl patch service myapp -p '{"spec":{"selector":{"version":"blue"}}}'
# 7. 清理蓝环境(确认绿环境稳定后)
kubectl delete deployment myapp-blue
方案2:Ingress切换
# blue-green-ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: myapp-ingress
spec:
rules:
- host: myapp.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: myapp-blue # 指向蓝环境
port:
number: 80
切换:
# 切换到绿环境
kubectl patch ingress myapp-ingress --type='json' \
-p='[{"op": "replace", "path": "/spec/rules/0/http/paths/0/backend/service/name", "value":"myapp-green"}]'
Istio实现
# virtualservice-blue-green.yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: myapp
spec:
hosts:
- myapp
http:
- route:
- destination:
host: myapp
subset: blue
weight: 100 # 100%流量到蓝环境
- destination:
host: myapp
subset: green
weight: 0
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: myapp
spec:
host: myapp
subsets:
- name: blue
labels:
version: blue
- name: green
labels:
version: green
切换脚本:
#!/bin/bash
# blue-green-switch.sh
# 切换到绿环境
kubectl apply -f - <<EOF
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: myapp
spec:
hosts:
- myapp
http:
- route:
- destination:
host: myapp
subset: green
weight: 100
- destination:
host: myapp
subset: blue
weight: 0
EOF
echo "Switched to GREEN environment"
灰度发布
原理
逐步增加新版本流量比例
阶段1:5%流量到v2
┌──────────┐
│ Users │
└──────────┘
↓ ↓
95% 5%
↓ ↓
v1 v2
阶段2:50%流量到v2
┌──────────┐
│ Users │
└──────────┘
↓ ↓
50% 50%
↓ ↓
v1 v2
阶段3:100%流量到v2
┌──────────┐
│ Users │
└──────────┘
↓ 100%
v2
Istio实现
完整灰度发布流程:
# 阶段1:5%流量到v2
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: myapp-canary
spec:
hosts:
- myapp
http:
- route:
- destination:
host: myapp
subset: v1
weight: 95
- destination:
host: myapp
subset: v2
weight: 5 # 5%灰度流量
自动化灰度发布脚本:
#!/bin/bash
# canary-rollout.sh
SERVICE_NAME="myapp"
STAGES=(5 10 25 50 75 100)
for weight in "${STAGES[@]}"; do
echo "Canary weight: $weight%"
# 更新流量权重
kubectl apply -f - <<EOF
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: $SERVICE_NAME
spec:
hosts:
- $SERVICE_NAME
http:
- route:
- destination:
host: $SERVICE_NAME
subset: v1
weight: $((100 - weight))
- destination:
host: $SERVICE_NAME
subset: v2
weight: $weight
EOF
# 等待观察(监控指标)
echo "Waiting for metrics observation..."
sleep 300 # 等待5分钟
# 检查错误率
ERROR_RATE=$(curl -s "http://prometheus:9090/api/v1/query?query=rate(http_requests_total{status=~\"5..\",version=\"v2\"}[5m])" | jq -r '.data.result[0].value[1]')
if (( $(echo "$ERROR_RATE > 0.01" | bc -l) )); then
echo "Error rate too high ($ERROR_RATE), rolling back..."
# 回滚到v1
kubectl apply -f - <<EOF
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: $SERVICE_NAME
spec:
hosts:
- $SERVICE_NAME
http:
- route:
- destination:
host: $SERVICE_NAME
subset: v1
weight: 100
EOF
exit 1
fi
echo "Stage $weight% completed successfully"
done
echo "Canary rollout completed!"
Nginx实现
基于Cookie的灰度:
# nginx.conf
upstream backend_v1 {
server backend-v1:8080;
}
upstream backend_v2 {
server backend-v2:8080;
}
map $cookie_version $backend_version {
"v2" backend_v2;
default backend_v1;
}
server {
listen 80;
location / {
# 10%用户随机灰度
if ($request_id ~* ^[0-9]$) {
add_header Set-Cookie "version=v2; Path=/; Max-Age=86400";
proxy_pass http://backend_v2;
break;
}
proxy_pass http://$backend_version;
}
}
金丝雀发布
原理
金丝雀发布 = 小流量灰度发布 + 自动化监控决策
┌──────────────────────────────────┐
│ Canary Deployment Pipeline │
└──────────────────────────────────┘
↓
┌─────────────────┐
│ 1. Deploy v2 │ 部署金丝雀实例
│ (1 replica) │
└─────────────────┘
↓
┌─────────────────┐
│ 2. Route 5% │ 5%流量到v2
│ traffic │
└─────────────────┘
↓
┌─────────────────┐
│ 3. Monitor │ 监控指标:
│ metrics │ - 错误率
│ │ - 延迟
│ │ - 成功率
└─────────────────┘
↓ ↓
Success Failure
↓ ↓
┌─────┐ ┌──────┐
│Scale│ │ Rollback│
│ v2 │ │ to v1 │
└─────┘ └──────────┘
Flagger实现(自动化金丝雀)
安装Flagger:
# 安装Flagger
kubectl apply -k github.com/fluxcd/flagger//kustomize/istio
# 验证
kubectl get pods -n istio-system | grep flagger
Canary配置:
# canary.yaml
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: myapp
namespace: default
spec:
# 目标Deployment
targetRef:
apiVersion: apps/v1
kind: Deployment
name: myapp
# 进度配置
progressDeadlineSeconds: 60
# Service配置
service:
port: 80
targetPort: 8080
# 分析配置(监控指标)
analysis:
# 检查间隔
interval: 1m
# 检查次数
threshold: 5
# 最大权重
maxWeight: 50
# 权重增量
stepWeight: 10
# 监控指标
metrics:
# 1. 请求成功率
- name: request-success-rate
thresholdRange:
min: 99
interval: 1m
# 2. 请求延迟
- name: request-duration
thresholdRange:
max: 500
interval: 1m
# 3. 错误率
- name: error-rate
thresholdRange:
max: 1
interval: 1m
# Webhook(可选,外部验证)
webhooks:
- name: load-test
url: http://flagger-loadtester/
timeout: 5s
metadata:
cmd: "hey -z 1m -q 10 -c 2 http://myapp-canary:80/"
Flagger工作流程:
1. 初始状态:
myapp-primary (v1): 100%
2. 部署新版本:
kubectl set image deployment/myapp myapp=myapp:v2
3. Flagger自动执行:
阶段1: 创建myapp-canary (v2)
阶段2: 10%流量 → canary
阶段3: 检查指标(1分钟)
阶段4: 成功 → 20%流量
阶段5: 检查指标(1分钟)
阶段6: 成功 → 30%流量
...
阶段N: 100%流量 → canary
4. 升级完成:
myapp-primary更新为v2
myapp-canary删除
5. 如果失败:
流量回滚到myapp-primary (v1)
myapp-canary删除
滚动更新
Kubernetes滚动更新
默认策略:
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
spec:
replicas: 10
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1 # 最多1个Pod不可用
maxSurge: 1 # 最多超出1个Pod
template:
metadata:
labels:
app: myapp
spec:
containers:
- name: myapp
image: myapp:v2
readinessProbe: # 就绪探针(确保Pod准备好才接收流量)
httpGet:
path: /health
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
更新流程:
初始:10个Pod(v1)
[v1] [v1] [v1] [v1] [v1] [v1] [v1] [v1] [v1] [v1]
步骤1:创建1个v2,删除1个v1
[v1] [v1] [v1] [v1] [v1] [v1] [v1] [v1] [v1] [v2]
步骤2:创建1个v2,删除1个v1
[v1] [v1] [v1] [v1] [v1] [v1] [v1] [v1] [v2] [v2]
...
最终:10个Pod(v2)
[v2] [v2] [v2] [v2] [v2] [v2] [v2] [v2] [v2] [v2]
手动控制滚动更新:
# 暂停滚动更新
kubectl rollout pause deployment/myapp
# 恢复滚动更新
kubectl rollout resume deployment/myapp
# 查看更新状态
kubectl rollout status deployment/myapp
# 查看更新历史
kubectl rollout history deployment/myapp
# 回滚到上一版本
kubectl rollout undo deployment/myapp
# 回滚到指定版本
kubectl rollout undo deployment/myapp --to-revision=2
A/B测试
原理
多个版本同时运行,根据用户特征路由到不同版本
┌──────────────────┐
│ Users │
└──────────────────┘
↓ ↓
Group A Group B
↓ ↓
┌──────┐ ┌──────┐
│ v1 │ │ v2 │
│(红色 │ │(蓝色 │
│ 按钮) │ │ 按钮) │
└──────┘ └──────┘
↓ ↓
收集数据 收集数据
↓ ↓
┌──────────────────┐
│ 数据分析 │
│ - 点击率 │
│ - 转化率 │
│ - 留存率 │
└──────────────────┘
↓
选择最优版本
Istio实现
基于Header的A/B测试:
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: myapp-ab-test
spec:
hosts:
- myapp
http:
# Group A:用户ID为偶数
- match:
- headers:
user-id:
regex: ".*[02468]$"
route:
- destination:
host: myapp
subset: version-a
# Group B:用户ID为奇数
- match:
- headers:
user-id:
regex: ".*[13579]$"
route:
- destination:
host: myapp
subset: version-b
基于Cookie的A/B测试:
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: myapp-ab-test
spec:
hosts:
- myapp
http:
# Group A
- match:
- headers:
cookie:
regex: ".*ab_test=version_a.*"
route:
- destination:
host: myapp
subset: version-a
# Group B
- match:
- headers:
cookie:
regex: ".*ab_test=version_b.*"
route:
- destination:
host: myapp
subset: version-b
# 默认:随机分配
- route:
- destination:
host: myapp
subset: version-a
weight: 50
- destination:
host: myapp
subset: version-b
weight: 50
实战案例
完整灰度发布流程
场景:电商系统订单服务从v1升级到v2
1. 准备工作:
# 部署v1(当前版本)
kubectl apply -f order-service-v1.yaml
# 部署v2(新版本,0流量)
kubectl apply -f order-service-v2.yaml
# 配置Istio路由(初始100%流量到v1)
kubectl apply -f virtualservice-initial.yaml
2. 灰度发布配置:
# canary-rollout.yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: order-service
spec:
hosts:
- order-service
http:
- route:
- destination:
host: order-service
subset: v1
weight: 95
- destination:
host: order-service
subset: v2
weight: 5 # 阶段1:5%灰度
3. 监控脚本:
#!/bin/bash
# monitor-canary.sh
SERVICE="order-service"
VERSION="v2"
while true; do
# 获取错误率
ERROR_RATE=$(kubectl exec -n istio-system deploy/prometheus -c prometheus -- \
wget -qO- "http://localhost:9090/api/v1/query?query=rate(istio_requests_total{destination_service=\"$SERVICE\",destination_version=\"$VERSION\",response_code=~\"5..\"}[1m])/rate(istio_requests_total{destination_service=\"$SERVICE\",destination_version=\"$VERSION\"}[1m])" \
| jq -r '.data.result[0].value[1]')
# 获取P99延迟
P99_LATENCY=$(kubectl exec -n istio-system deploy/prometheus -c prometheus -- \
wget -qO- "http://localhost:9090/api/v1/query?query=histogram_quantile(0.99,rate(istio_request_duration_milliseconds_bucket{destination_service=\"$SERVICE\",destination_version=\"$VERSION\"}[1m]))" \
| jq -r '.data.result[0].value[1]')
echo "Error Rate: $ERROR_RATE, P99 Latency: ${P99_LATENCY}ms"
# 检查阈值
if (( $(echo "$ERROR_RATE > 0.01" | bc -l) )); then
echo "ERROR RATE TOO HIGH! Triggering rollback..."
kubectl apply -f virtualservice-rollback.yaml
exit 1
fi
if (( $(echo "$P99_LATENCY > 500" | bc -l) )); then
echo "LATENCY TOO HIGH! Triggering rollback..."
kubectl apply -f virtualservice-rollback.yaml
exit 1
fi
sleep 60
done
4. 自动化发布脚本:
#!/bin/bash
# auto-canary-rollout.sh
SERVICE="order-service"
STAGES=(5 10 25 50 100)
MONITOR_DURATION=300 # 每个阶段监控5分钟
for weight in "${STAGES[@]}"; do
echo "========================================="
echo "Canary Stage: $weight%"
echo "========================================="
# 更新流量权重
cat <<EOF | kubectl apply -f -
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: $SERVICE
spec:
hosts:
- $SERVICE
http:
- route:
- destination:
host: $SERVICE
subset: v1
weight: $((100 - weight))
- destination:
host: $SERVICE
subset: v2
weight: $weight
EOF
echo "Waiting ${MONITOR_DURATION}s for metrics..."
# 监控指标
for i in $(seq 1 $((MONITOR_DURATION / 60))); do
./monitor-canary.sh || exit 1
done
echo "Stage $weight% completed successfully!"
echo ""
done
echo "🎉 Canary rollout completed successfully!"
# 清理v1
kubectl delete deployment order-service-v1
面试问答
蓝绿部署和灰度发布有什么区别?
答案:
| 维度 | 蓝绿部署 | 灰度发布 |
|---|---|---|
| 流量切换 | 瞬间切换(0→100%) | 逐步切换(5%→100%) |
| 资源消耗 | 高(双倍环境) | 低(逐步扩容) |
| 风险 | 中等(影响全部用户) | 低(逐步暴露风险) |
| 回滚速度 | 快(瞬间切回) | 慢(需要调整权重) |
| 复杂度 | 低 | 高(需要监控和决策) |
| 适用场景 | 大版本更新 | 常规迭代 |
形象比喻:
蓝绿部署 = 换车道
- 从左车道瞬间切到右车道
- 需要两条车道(成本高)
- 切回也很快
灰度发布 = 慢慢并道
- 逐渐从左车道并到右车道
- 不需要完整的两条车道
- 更安全,但需要时间
如何实现自动化灰度发布?
答案:
核心要素:
1. 流量控制:
- Istio VirtualService
- Nginx upstream weight
- Kubernetes Service
2. 监控指标:
- 错误率(< 1%)
- 延迟(P99 < 500ms)
- 成功率(> 99%)
3. 自动化决策:
- 指标达标 → 增加流量
- 指标异常 → 回滚
- 全部达标 → 完成发布
4. 告警通知:
- Slack/钉钉通知
- 邮件通知
- PagerDuty告警
实现方案:
方案1:Flagger(推荐)
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: myapp
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: myapp
progressDeadlineSeconds: 60
service:
port: 80
analysis:
interval: 1m
threshold: 5
maxWeight: 50
stepWeight: 10
metrics:
- name: request-success-rate
thresholdRange:
min: 99
- name: request-duration
thresholdRange:
max: 500
方案2:自研脚本
# canary_controller.py
import time
import requests
class CanaryController:
def __init__(self, service_name):
self.service_name = service_name
self.stages = [5, 10, 25, 50, 100]
def get_error_rate(self, version):
# 从Prometheus获取错误率
query = f'rate(http_requests_total{{version="{version}",status=~"5.."}}[1m])'
resp = requests.get(f'http://prometheus:9090/api/v1/query?query={query}')
return float(resp.json()['data']['result'][0]['value'][1])
def get_latency(self, version):
# 获取P99延迟
query = f'histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{{version="{version}"}}[1m]))'
resp = requests.get(f'http://prometheus:9090/api/v1/query?query={query}')
return float(resp.json()['data']['result'][0]['value'][1]) * 1000
def update_traffic_weight(self, weight):
# 更新Istio VirtualService
# kubectl patch ...
pass
def rollback(self):
# 回滚到旧版本
self.update_traffic_weight(0)
self.notify("Rollback triggered!")
def notify(self, message):
# 发送通知到Slack
requests.post('https://hooks.slack.com/...', json={'text': message})
def run(self):
for weight in self.stages:
print(f"Canary stage: {weight}%")
# 更新流量权重
self.update_traffic_weight(weight)
# 等待5分钟观察
time.sleep(300)
# 检查指标
error_rate = self.get_error_rate('v2')
latency = self.get_latency('v2')
if error_rate > 0.01:
print(f"Error rate too high: {error_rate}")
self.rollback()
return False
if latency > 500:
print(f"Latency too high: {latency}ms")
self.rollback()
return False
print(f"Stage {weight}% passed!")
self.notify("Canary rollout completed!")
return True
# 使用
controller = CanaryController('myapp')
controller.run()
灰度发布如何选择灰度用户?
答案:
常见策略:
1. 基于比例(随机)
# 10%用户随机灰度
spec:
http:
- route:
- destination:
subset: v1
weight: 90
- destination:
subset: v2
weight: 10
2. 基于用户ID
# 用户ID尾号为0的用户灰度
spec:
http:
- match:
- headers:
user-id:
regex: ".*0$"
route:
- destination:
subset: v2
3. 基于地域
# 北京地区用户灰度
spec:
http:
- match:
- headers:
x-region:
exact: "beijing"
route:
- destination:
subset: v2
4. 白名单(内部用户)
# 测试账号灰度
spec:
http:
- match:
- headers:
user-id:
exact: "test-user-001"
route:
- destination:
subset: v2
5. Cookie(已选择加入的用户)
spec:
http:
- match:
- headers:
cookie:
regex: ".*beta_tester=true.*"
route:
- destination:
subset: v2
最佳实践:
阶段1:白名单(内部测试)
阶段2:小比例随机(5%)
阶段3:特定地域(低峰地区)
阶段4:全量放开(100%)