12-生产清单

Kubernetes 上线前检查清单与最佳实践

学习目标

通过本模块学习，你将掌握：

生产环境上线检查清单
高可用架构配置要求
安全基线配置标准
监控告警配置规范
备份恢复流程
SRE Runbook 编写

一、上线前检查清单

1.1 控制面检查清单

etcd 配置

检查项清单：
□ etcd 集群至少 3 个节点（奇数个）
□ etcd 使用 SSD 存储
□ etcd 数据目录独立磁盘
□ 配置自动备份（至少每 6 小时）
□ 备份存储在异地
□ 定期测试恢复流程（每月）
□ 配置磁盘配额告警
□ 启用 TLS 加密
□ 配置 defrag 定期执行
□ 监控 etcd 性能指标（延迟、IOPS）

---
# etcd 配置示例
apiVersion: v1
kind: Pod
metadata:
  name: etcd
  namespace: kube-system
spec:
  containers:
  - name: etcd
    image: registry.k8s.io/etcd:3.5.9-0
    command:
    - etcd
    - --data-dir=/var/lib/etcd
    - --advertise-client-urls=https://0.0.0.0:2379
    - --listen-client-urls=https://0.0.0.0:2379
    - --initial-advertise-peer-urls=https://0.0.0.0:2380
    - --listen-peer-urls=https://0.0.0.0:2380
    - --cert-file=/etc/kubernetes/pki/etcd/server.crt
    - --key-file=/etc/kubernetes/pki/etcd/server.key
    - --client-cert-auth=true
    - --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
    - --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
    - --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
    - --peer-client-cert-auth=true
    - --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
    - --snapshot-count=10000
    - --quota-backend-bytes=8589934592  # 8GB
    volumeMounts:
    - name: etcd-data
      mountPath: /var/lib/etcd

API Server 配置

检查项清单：
□ API Server 至少 2 个副本
□ 启用审计日志
□ 配置 Admission Controllers
□ 启用 RBAC
□ 配置请求限流（Priority & Fairness）
□ 启用 TLS
□ 配置资源限制
□ 启用事件 TTL
□ 配置超时参数
□ 监控 API Server 指标

---
# API Server 配置参数
apiVersion: v1
kind: Pod
metadata:
  name: kube-apiserver
  namespace: kube-system
spec:
  containers:
  - name: kube-apiserver
    command:
    - kube-apiserver
    - --advertise-address=10.0.1.10
    - --enable-admission-plugins=NodeRestriction,PodSecurity,ResourceQuota,LimitRanger
    - --audit-policy-file=/etc/kubernetes/audit-policy.yaml
    - --audit-log-path=/var/log/kubernetes/audit.log
    - --audit-log-maxage=30
    - --audit-log-maxbackup=10
    - --audit-log-maxsize=100
    - --enable-bootstrap-token-auth=true
    - --authorization-mode=Node,RBAC
    - --client-ca-file=/etc/kubernetes/pki/ca.crt
    - --enable-aggregator-routing=true
    - --etcd-servers=https://127.0.0.1:2379
    - --event-ttl=1h
    - --max-requests-inflight=400
    - --max-mutating-requests-inflight=200
    - --request-timeout=1m0s

1.2 节点配置清单

检查项清单：
□ Worker 节点至少 3 个
□ 节点分布在不同可用区
□ 配置节点标签（zone、instance-type）
□ 设置节点污点（针对特殊工作负载）
□ kubelet 配置优化
□ 容器运行时配置（containerd/CRI-O）
□ 系统参数优化（sysctl）
□ 磁盘监控和告警
□ 节点自动扩缩容配置
□ 节点维护窗口设置

---
# kubelet 配置
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
authentication:
  anonymous:
    enabled: false
  webhook:
    enabled: true
authorization:
  mode: Webhook
cgroupDriver: systemd
clusterDNS:
  - 10.96.0.10
clusterDomain: cluster.local
containerLogMaxSize: 50Mi
containerLogMaxFiles: 5
cpuManagerPolicy: static
evictionHard:
  memory.available: "200Mi"
  nodefs.available: "10%"
  nodefs.inodesFree: "5%"
evictionSoft:
  memory.available: "500Mi"
  nodefs.available: "15%"
evictionSoftGracePeriod:
  memory.available: "1m30s"
  nodefs.available: "2m"
maxPods: 110

二、安全配置清单

2.1 RBAC 配置

检查项清单：
□ 禁用默认 ServiceAccount 自动挂载
□ 为每个应用创建专用 ServiceAccount
□ 最小权限原则配置 Role/ClusterRole
□ 定期审查权限
□ 禁用匿名访问
□ 配置用户认证（OIDC/LDAP）
□ 限制 cluster-admin 使用
□ 审计 RBAC 变更
□ 使用 RoleBinding 而非 ClusterRoleBinding
□ 定期清理无用的 ServiceAccount

---
# 最小权限 ServiceAccount
apiVersion: v1
kind: ServiceAccount
metadata:
  name: app-sa
  namespace: production
automountServiceAccountToken: false
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: app-role
  namespace: production
rules:
- apiGroups: [""]
  resources: ["configmaps"]
  verbs: ["get", "list"]
- apiGroups: [""]
  resources: ["secrets"]
  resourceNames: ["app-secret"]
  verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: app-rolebinding
  namespace: production
subjects:
- kind: ServiceAccount
  name: app-sa
  namespace: production
roleRef:
  kind: Role
  name: app-role
  apiGroup: rbac.authorization.k8s.io

2.2 Pod Security 配置

检查项清单：
□ 启用 Pod Security Admission
□ 命名空间配置安全策略
□ 禁止特权容器
□ 限制 hostPath 挂载
□ 禁止 hostNetwork/hostPID/hostIPC
□ 强制使用非 root 用户
□ 配置 securityContext
□ 限制 capabilities
□ 启用 AppArmor/SELinux
□ 配置只读根文件系统

---
# 命名空间 Pod Security 配置
apiVersion: v1
kind: Namespace
metadata:
  name: production
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/warn: restricted
---
# 安全的 Pod 配置
apiVersion: v1
kind: Pod
metadata:
  name: secure-pod
  namespace: production
spec:
  serviceAccountName: app-sa
  automountServiceAccountToken: false
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    fsGroup: 2000
    seccompProfile:
      type: RuntimeDefault
  containers:
  - name: app
    image: myapp:v1.0
    securityContext:
      allowPrivilegeEscalation: false
      readOnlyRootFilesystem: true
      runAsNonRoot: true
      runAsUser: 1000
      capabilities:
        drop:
        - ALL
    volumeMounts:
    - name: tmp
      mountPath: /tmp
  volumes:
  - name: tmp
    emptyDir: {}

2.3 网络安全配置

检查项清单：
□ 所有命名空间配置 NetworkPolicy
□ 默认拒绝所有入站流量
□ 白名单方式配置允许规则
□ 限制出站流量
□ 配置命名空间隔离
□ 使用服务网格（可选）
□ 启用 TLS 加密通信
□ 配置 Ingress TLS
□ 限制对敏感服务的访问
□ 定期审查网络策略

---
# 默认拒绝策略
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: production
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress
---
# 允许特定流量
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-frontend-to-backend
  namespace: production
spec:
  podSelector:
    matchLabels:
      tier: backend
  policyTypes:
  - Ingress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          tier: frontend
    ports:
    - protocol: TCP
      port: 8080
---
# 限制出站流量
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-dns-egress
  namespace: production
spec:
  podSelector: {}
  policyTypes:
  - Egress
  egress:
  - to:
    - namespaceSelector:
        matchLabels:
          name: kube-system
    ports:
    - protocol: UDP
      port: 53

三、监控与告警配置清单

3.1 监控指标

检查项清单：
□ 部署 Prometheus + Grafana
□ 配置 kube-state-metrics
□ 配置 node-exporter
□ 监控 APIServer 指标
□ 监控 etcd 指标
□ 监控控制器管理器指标
□ 监控调度器指标
□ 配置应用指标采集
□ 配置自定义指标
□ 设置数据保留策略

---
# ServiceMonitor 配置
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: app-metrics
  namespace: production
spec:
  selector:
    matchLabels:
      app: myapp
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics

3.2 告警规则

检查项清单：
□ 配置集群级别告警
□ 配置节点级别告警
□ 配置 Pod 级别告警
□ 配置应用级别告警
□ 配置 SLO 告警
□ 设置告警分级（Critical/Warning）
□ 配置告警路由
□ 配置告警抑制规则
□ 测试告警通知
□ 定期审查告警规则

---
# 关键告警规则
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-alerts
  namespace: monitoring
data:
  alerts.yml: |
    groups:
    - name: critical-alerts
      interval: 30s
      rules:
      # API Server 可用性
      - alert: APIServerDown
        expr: up{job="kubernetes-apiservers"} == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "API Server is down"
          description: "API Server has been down for more than 5 minutes."
      
      # etcd 健康状态
      - alert: EtcdClusterDown
        expr: up{job="etcd"} == 0
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "etcd cluster is down"
      
      # 节点 NotReady
      - alert: NodeNotReady
        expr: kube_node_status_condition{condition="Ready",status="true"} == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Node {{ $labels.node }} is not ready"
      
      # PVC 空间不足
      - alert: PVCAlmostFull
        expr: kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes < 0.1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "PVC {{ $labels.persistentvolumeclaim }} is almost full"
      
      # Pod 重启频繁
      - alert: PodCrashLooping
        expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"

四、备份与恢复配置清单

4.1 备份策略

检查项清单：
□ etcd 每 6 小时自动备份
□ etcd 备份存储在异地
□ 应用数据使用 Velero 备份
□ 配置备份保留策略
□ 定期测试恢复流程
□ 备份加密存储
□ 监控备份任务状态
□ 配置备份失败告警
□ 文档化恢复流程
□ 定期演练灾难恢复

---
# Velero 备份计划
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: production-daily-backup
  namespace: velero
spec:
  schedule: "0 2 * * *"  # 每天凌晨 2 点
  template:
    includedNamespaces:
    - production
    - staging
    excludedResources:
    - events
    - events.events.k8s.io
    ttl: 720h  # 30 天保留
    snapshotVolumes: true
    storageLocation: default
    hooks:
      resources:
      - name: postgres-backup-hook
        includedNamespaces:
        - production
        labelSelector:
          matchLabels:
            app: postgres
        pre:
        - exec:
            container: postgres
            command:
            - /bin/bash
            - -c
            - "PGPASSWORD=$POSTGRES_PASSWORD pg_dump -U $POSTGRES_USER $POSTGRES_DB > /tmp/backup.sql"
            onError: Fail
            timeout: 3m

五、应用配置清单

5.1 工作负载配置

检查项清单：
□ 设置合理的 Requests 和 Limits
□ 配置 HPA（如需要）
□ 配置 PDB
□ 配置健康检查探针
□ 配置优雅关闭
□ 设置镜像拉取策略
□ 配置资源配额
□ 设置反亲和性（高可用）
□ 配置拓扑分布约束
□ 使用 Init Containers（如需要）

---
# 生产级 Deployment 配置
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
  namespace: production
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
        version: v1.0
    spec:
      serviceAccountName: web-sa
      automountServiceAccountToken: false
      terminationGracePeriodSeconds: 60
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - web
            topologyKey: kubernetes.io/hostname
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: web
      containers:
      - name: web
        image: myregistry.com/web:v1.0
        imagePullPolicy: IfNotPresent
        ports:
        - name: http
          containerPort: 8080
        env:
        - name: PORT
          value: "8080"
        - name: ENVIRONMENT
          value: "production"
        resources:
          requests:
            cpu: 500m
            memory: 512Mi
          limits:
            cpu: 1000m
            memory: 1Gi
        startupProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
          failureThreshold: 30
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          periodSeconds: 10
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          periodSeconds: 5
          failureThreshold: 3
        lifecycle:
          preStop:
            exec:
              command:
              - sh
              - -c
              - "sleep 15 && kill -SIGTERM 1"
        securityContext:
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: true
          runAsNonRoot: true
          runAsUser: 1000
          capabilities:
            drop:
            - ALL
        volumeMounts:
        - name: tmp
          mountPath: /tmp
        - name: cache
          mountPath: /app/cache
      volumes:
      - name: tmp
        emptyDir: {}
      - name: cache
        emptyDir: {}
---
# PodDisruptionBudget
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: web-pdb
  namespace: production
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: web
---
# HorizontalPodAutoscaler
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicas: 3
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 50
        periodSeconds: 15
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
      - type: Percent
        value: 100
        periodSeconds: 15
      - type: Pods
        value: 4
        periodSeconds: 15
      selectPolicy: Max

六、SRE Runbook 模板

6.1 部署 Runbook

# 应用部署 Runbook

## 部署前检查
- [ ] 代码已通过所有测试
- [ ] 代码已通过 Code Review
- [ ] 镜像已构建并推送到仓库
- [ ] 配置文件已更新
- [ ] 数据库迁移已准备（如需要）
- [ ] 通知相关团队

## 部署步骤

### 1. 备份当前状态
```bash
# 备份当前 Deployment
kubectl get deployment web-app -n production -o yaml > backup-deployment.yaml

# 记录当前镜像版本
kubectl get deployment web-app -n production -o jsonpath='{.spec.template.spec.containers[0].image}'

2. 更新配置

# 更新 ConfigMap（如需要）
kubectl apply -f configmap.yaml

# 更新 Secret（如需要）
kubectl apply -f secret.yaml

3. 执行部署

# 应用新版本
kubectl apply -f deployment.yaml

# 监控部署进度
kubectl rollout status deployment/web-app -n production

# 实时查看 Pod 状态
kubectl get pods -n production -w -l app=web

4. 验证部署

# 检查 Pod 状态
kubectl get pods -n production -l app=web

# 检查服务端点
kubectl get endpoints web-service -n production

# 执行冒烟测试
curl https://web.example.com/health

5. 回滚步骤（如需要）

# 查看部署历史
kubectl rollout history deployment/web-app -n production

# 回滚到上一个版本
kubectl rollout undo deployment/web-app -n production

# 回滚到特定版本
kubectl rollout undo deployment/web-app -n production --to-revision=2

部署后检查

[ ] 所有 Pod 处于 Running 状态
[ ] 服务端点正常
[ ] 健康检查通过
[ ] 监控指标正常
[ ] 日志无异常错误
[ ] 性能测试通过

回滚触发条件

Pod 无法启动
健康检查失败率 > 10%
错误率 > 1%
响应时间 > P99 阈值
业务功能异常


### 6.2 故障响应 Runbook

```markdown
# 故障响应 Runbook

## 1. Pod 无法启动

### 诊断步骤
```bash
# 查看 Pod 状态
kubectl get pods -n production

# 查看详细信息
kubectl describe pod <pod-name> -n production

# 查看事件
kubectl get events -n production --sort-by='.lastTimestamp'

# 查看日志
kubectl logs <pod-name> -n production
kubectl logs <pod-name> -n production --previous

常见问题和解决方案

ImagePullBackOff

# 检查镜像名称
kubectl get pod <pod-name> -n production -o jsonpath='{.spec.containers[*].image}'

# 验证镜像存在
crictl pull <image-name>

# 检查 imagePullSecrets
kubectl get secret <secret-name> -n production

CrashLoopBackOff

# 查看容器日志
kubectl logs <pod-name> -c <container-name> -n production --previous

# 检查资源限制
kubectl get pod <pod-name> -n production -o jsonpath='{.spec.containers[*].resources}'

# 使用 debug 容器
kubectl debug <pod-name> -n production -it --image=busybox

2. 服务不可达

诊断步骤

# 检查 Service
kubectl get svc <service-name> -n production
kubectl describe svc <service-name> -n production

# 检查 Endpoints
kubectl get endpoints <service-name> -n production

# 测试连接
kubectl run test --rm -it --image=nicolaka/netshoot -- curl http://<service-name>.<namespace>.svc.cluster.local

解决方案

验证 Service Selector 匹配 Pod 标签
检查 Pod readinessProbe 状态
检查 NetworkPolicy 配置
验证 DNS 解析

3. 节点异常

诊断步骤

# 查看节点状态
kubectl get nodes
kubectl describe node <node-name>

# 检查资源使用
kubectl top node <node-name>

# 查看系统日志
ssh <node-name>
journalctl -u kubelet -f

维护步骤

# 标记节点不可调度
kubectl cordon <node-name>

# 驱逐 Pod
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

# 执行维护

# 恢复节点
kubectl uncordon <node-name>

升级联系方式

L1 Support: support@example.com
L2 SRE Team: sre-oncall@example.com
L3 Dev Team: dev-oncall@example.com


##  七、性能基准

### 7.1 资源配置基准

```yaml
# 小型应用（QPS < 100）
resources:
  requests:
    cpu: 100m
    memory: 128Mi
  limits:
    cpu: 500m
    memory: 512Mi

# 中型应用（QPS 100-1000）
resources:
  requests:
    cpu: 500m
    memory: 512Mi
  limits:
    cpu: 2000m
    memory: 2Gi

# 大型应用（QPS > 1000）
resources:
  requests:
    cpu: 2000m
    memory: 4Gi
  limits:
    cpu: 4000m
    memory: 8Gi

7.2 SLO 指标基准

# 服务级别目标
SLO:
  availability: 99.9%        # 月停机时间 < 43 分钟
  latency_p50: < 100ms
  latency_p95: < 500ms
  latency_p99: < 1000ms
  error_rate: < 0.1%
  
# 告警阈值
Alert Thresholds:
  availability: < 99.5%      # 触发告警
  latency_p99: > 2000ms
  error_rate: > 1%

八、最终检查清单

上线前总检查

控制面:
  □ etcd 集群健康（3+ 节点）
  □ API Server 高可用（2+ 副本）
  □ 控制器管理器正常
  □ 调度器正常
  □ 所有组件启用 TLS

节点:
  □ 至少 3 个 Worker 节点
  □ 节点分布在不同可用区
  □ 资源充足（CPU/内存/磁盘）
  □ 监控和日志收集正常

网络:
  □ CNI 插件正常
  □ CoreDNS 正常
  □ NetworkPolicy 已配置
  □ Ingress 配置正确
  □ 证书有效期充足

安全:
  □ RBAC 配置完成
  □ Pod Security 启用
  □ NetworkPolicy 默认拒绝
  □ 镜像扫描已执行
  □ Secrets 加密存储

应用:
  □ 资源请求合理设置
  □ 健康检查已配置
  □ HPA 已配置（如需要）
  □ PDB 已配置
  □ 反亲和性已设置

监控:
  □ Prometheus 正常运行
  □ Grafana Dashboard 配置
  □ 告警规则已配置
  □ 告警通知正常
  □ 日志收集正常

备份:
  □ etcd 自动备份配置
  □ Velero 备份计划配置
  □ 备份存储在异地
  □ 恢复流程已测试

文档:
  □ 架构文档完整
  □ Runbook 已编写
  □ 联系方式已更新
  □ SLA 已定义