09-成本与容量

Kubernetes FinOps 与资源优化全解析

学习目标

通过本模块学习，你将掌握：

Kubernetes 成本优化策略
资源容量规划方法
Requests/Limits 精细化调优
节点池分层与 Spot 实例
Descheduler 资源重平衡
FinOps 最佳实践

一、FinOps 架构概览

成本优化体系

┌─────────────────────────────────────────────────────────────┐
│                    FinOps 成本优化体系                       │
├─────────────────────────────────────────────────────────────┤
│  资源层                                                      │
│  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐           │
│  │ CPU/内存    │ │ 存储        │ │ 网络        │           │
│  │ 优化        │ │ 优化        │ │ 优化        │           │
│  └─────────────┘ └─────────────┘ └─────────────┘           │
├─────────────────────────────────────────────────────────────┤
│  节点层                                                      │
│  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐           │
│  │ 按需实例    │ │ Spot实例    │ │ 预留实例    │           │
│  └─────────────┘ └─────────────┘ └─────────────┘           │
├─────────────────────────────────────────────────────────────┤
│  调度层                                                      │
│  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐           │
│  │ Bin Packing │ │ Descheduler │ │ Autoscaler  │           │
│  └─────────────┘ └─────────────┘ └─────────────┘           │
└─────────────────────────────────────────────────────────────┘

成本优化关键指标

指标	说明	目标值
资源利用率	CPU/内存实际使用率	> 70%
成本单位	每月每核成本	持续降低
浪费率	未使用资源占比	< 20%
Spot 占比	Spot 实例使用比例	> 50%

二、资源优化策略

2.1 Requests/Limits 调优

资源配置原则

# 不同 QoS 的资源配置策略

# Guaranteed - 关键业务
apiVersion: v1
kind: Pod
metadata:
  name: critical-app
spec:
  containers:
  - name: app
    image: app:latest
    resources:
      requests:
        cpu: "1000m"
        memory: "2Gi"
      limits:
        cpu: "1000m"    # requests == limits
        memory: "2Gi"

# Burstable - 一般业务
apiVersion: v1
kind: Pod
metadata:
  name: normal-app
spec:
  containers:
  - name: app
    image: app:latest
    resources:
      requests:
        cpu: "500m"
        memory: "1Gi"
      limits:
        cpu: "2000m"    # 允许突发
        memory: "4Gi"

# BestEffort - 批处理任务
apiVersion: v1
kind: Pod
metadata:
  name: batch-job
spec:
  containers:
  - name: job
    image: job:latest
    # 无 resources 配置

2.2 VPA Recommender 使用

VPA 推荐模式配置

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: app-vpa
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  updatePolicy:
    updateMode: "Off"  # 仅推荐，不自动更新
  resourcePolicy:
    containerPolicies:
    - containerName: '*'
      minAllowed:
        cpu: 100m
        memory: 128Mi
      maxAllowed:
        cpu: 2000m
        memory: 4Gi
      controlledResources: ["cpu", "memory"]
      controlledValues: RequestsAndLimits

获取 VPA 推荐值

# 查看 VPA 推荐
kubectl get vpa app-vpa -o jsonpath='{.status.recommendation}'

# 格式化输出推荐值
kubectl get vpa app-vpa -o jsonpath='{.status.recommendation.containerRecommendations[*]}' | jq

2.3 LimitRange 资源限制

apiVersion: v1
kind: LimitRange
metadata:
  name: resource-limits
  namespace: production
spec:
  limits:
  # Container 级别限制
  - type: Container
    default:
      cpu: 500m
      memory: 512Mi
    defaultRequest:
      cpu: 100m
      memory: 128Mi
    max:
      cpu: 2000m
      memory: 2Gi
    min:
      cpu: 50m
      memory: 64Mi
  
  # Pod 级别限制
  - type: Pod
    max:
      cpu: 4000m
      memory: 8Gi
    min:
      cpu: 100m
      memory: 128Mi
  
  # PVC 限制
  - type: PersistentVolumeClaim
    max:
      storage: 100Gi
    min:
      storage: 1Gi

2.4 ResourceQuota 配额管理

apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-quota
  namespace: team-a
spec:
  hard:
    # 计算资源
    requests.cpu: "20"
    requests.memory: 40Gi
    limits.cpu: "40"
    limits.memory: 80Gi
    
    # 存储资源
    requests.storage: 100Gi
    persistentvolumeclaims: "10"
    
    # 对象数量
    pods: "50"
    services: "20"
    secrets: "30"
    configmaps: "30"
    
    # LoadBalancer 限制
    services.loadbalancers: "3"
    
    # NodePort 限制
    services.nodeports: "5"
  
  # 作用域
  scopeSelector:
    matchExpressions:
    - operator: In
      scopeName: PriorityClass
      values: ["high", "medium"]

三、节点池优化策略

3.1 节点池分层设计

# 高性能节点池 - 关键业务
apiVersion: v1
kind: Node
metadata:
  name: high-perf-node-1
  labels:
    node-type: high-performance
    workload-type: critical
    cost-tier: premium
spec:
  taints:
  - key: critical-only
    value: "true"
    effect: NoSchedule

---
# 通用节点池 - 一般业务
apiVersion: v1
kind: Node
metadata:
  name: general-node-1
  labels:
    node-type: general
    workload-type: standard
    cost-tier: standard

---
# Spot 节点池 - 批处理任务
apiVersion: v1
kind: Node
metadata:
  name: spot-node-1
  labels:
    node-type: spot
    workload-type: batch
    cost-tier: spot
  annotations:
    cluster-autoscaler.kubernetes.io/scale-down-disabled: "false"
spec:
  taints:
  - key: spot-instance
    value: "true"
    effect: NoSchedule

3.2 Pod 节点池选择策略

# 关键业务 - 高性能节点
apiVersion: apps/v1
kind: Deployment
metadata:
  name: critical-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: critical
  template:
    metadata:
      labels:
        app: critical
    spec:
      nodeSelector:
        node-type: high-performance
        cost-tier: premium
      tolerations:
      - key: critical-only
        operator: Equal
        value: "true"
        effect: NoSchedule
      containers:
      - name: app
        image: critical-app:latest
        resources:
          requests:
            cpu: "2000m"
            memory: "4Gi"
          limits:
            cpu: "2000m"
            memory: "4Gi"

---
# 批处理任务 - Spot 节点
apiVersion: batch/v1
kind: Job
metadata:
  name: batch-job
spec:
  template:
    spec:
      nodeSelector:
        node-type: spot
        cost-tier: spot
      tolerations:
      - key: spot-instance
        operator: Equal
        value: "true"
        effect: NoSchedule
      containers:
      - name: job
        image: batch-job:latest
        resources:
          requests:
            cpu: "1000m"
            memory: "2Gi"
      restartPolicy: OnFailure

3.3 Cluster Autoscaler 配置

apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-autoscaler
  namespace: kube-system
spec:
  replicas: 1
  selector:
    matchLabels:
      app: cluster-autoscaler
  template:
    metadata:
      labels:
        app: cluster-autoscaler
    spec:
      serviceAccountName: cluster-autoscaler
      containers:
      - name: cluster-autoscaler
        image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.28.0
        command:
        - ./cluster-autoscaler
        - --cloud-provider=aws
        - --namespace=kube-system
        - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/my-cluster
        - --balance-similar-node-groups
        - --skip-nodes-with-system-pods=false
        - --scale-down-enabled=true
        - --scale-down-delay-after-add=10m
        - --scale-down-unneeded-time=10m
        - --scale-down-utilization-threshold=0.5
        - --max-node-provision-time=15m
        - --expander=least-waste
        env:
        - name: AWS_REGION
          value: us-east-1
        resources:
          requests:
            cpu: 100m
            memory: 300Mi
          limits:
            cpu: 100m
            memory: 300Mi
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: cluster-autoscaler
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: cluster-autoscaler
rules:
- apiGroups: [""]
  resources: ["events", "endpoints"]
  verbs: ["create", "patch"]
- apiGroups: [""]
  resources: ["pods/eviction"]
  verbs: ["create"]
- apiGroups: [""]
  resources: ["pods/status"]
  verbs: ["update"]
- apiGroups: [""]
  resources: ["endpoints"]
  resourceNames: ["cluster-autoscaler"]
  verbs: ["get", "update"]
- apiGroups: [""]
  resources: ["nodes"]
  verbs: ["watch", "list", "get", "update"]
- apiGroups: [""]
  resources: ["pods", "services", "replicationcontrollers", "persistentvolumeclaims", "persistentvolumes"]
  verbs: ["watch", "list", "get"]
- apiGroups: ["extensions"]
  resources: ["replicasets", "daemonsets"]
  verbs: ["watch", "list", "get"]
- apiGroups: ["policy"]
  resources: ["poddisruptionbudgets"]
  verbs: ["watch", "list"]
- apiGroups: ["apps"]
  resources: ["statefulsets", "replicasets", "daemonsets"]
  verbs: ["watch", "list", "get"]
- apiGroups: ["storage.k8s.io"]
  resources: ["storageclasses", "csinodes", "csidrivers", "csistoragecapacities"]
  verbs: ["watch", "list", "get"]
- apiGroups: ["batch"]
  resources: ["jobs", "cronjobs"]
  verbs: ["watch", "list", "get"]
- apiGroups: ["coordination.k8s.io"]
  resources: ["leases"]
  verbs: ["create"]
- apiGroups: ["coordination.k8s.io"]
  resourceNames: ["cluster-autoscaler"]
  resources: ["leases"]
  verbs: ["get", "update"]

四、Descheduler 资源重平衡

4.1 Descheduler 部署

apiVersion: v1
kind: ConfigMap
metadata:
  name: descheduler-policy
  namespace: kube-system
data:
  policy.yaml: |
    apiVersion: "descheduler/v1alpha1"
    kind: "DeschedulerPolicy"
    strategies:
      # 移除违反亲和性的 Pod
      "RemovePodsViolatingInterPodAntiAffinity":
        enabled: true
      
      # 移除违反节点亲和性的 Pod
      "RemovePodsViolatingNodeAffinity":
        enabled: true
        params:
          nodeAffinityType:
          - "requiredDuringSchedulingIgnoredDuringExecution"
      
      # 移除违反污点的 Pod
      "RemovePodsViolatingNodeTaints":
        enabled: true
      
      # 低利用率节点的 Pod 重调度
      "LowNodeUtilization":
        enabled: true
        params:
          nodeResourceUtilizationThresholds:
            thresholds:
              cpu: 20
              memory: 20
              pods: 20
            targetThresholds:
              cpu: 50
              memory: 50
              pods: 50
      
      # 移除重复的 Pod
      "RemoveDuplicates":
        enabled: true
      
      # 移除失败的 Pod
      "RemovePodsHavingTooManyRestarts":
        enabled: true
        params:
          podsHavingTooManyRestarts:
            podRestartThreshold: 100
            includingInitContainers: true
      
      # Pod 生命周期管理
      "PodLifeTime":
        enabled: true
        params:
          podLifeTime:
            maxPodLifeTimeSeconds: 86400  # 24小时
---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: descheduler
  namespace: kube-system
spec:
  schedule: "*/30 * * * *"  # 每30分钟运行一次
  concurrencyPolicy: Forbid
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: descheduler
          containers:
          - name: descheduler
            image: registry.k8s.io/descheduler/descheduler:v0.28.0
            command:
            - /bin/descheduler
            - --policy-config-file=/policy/policy.yaml
            - --descheduling-interval=5m
            - --v=3
            volumeMounts:
            - name: policy
              mountPath: /policy
            resources:
              requests:
                cpu: 100m
                memory: 128Mi
              limits:
                cpu: 500m
                memory: 256Mi
          restartPolicy: Never
          volumes:
          - name: policy
            configMap:
              name: descheduler-policy
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: descheduler
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: descheduler
rules:
- apiGroups: [""]
  resources: ["events"]
  verbs: ["create", "update"]
- apiGroups: [""]
  resources: ["nodes"]
  verbs: ["get", "watch", "list"]
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "watch", "list", "delete"]
- apiGroups: [""]
  resources: ["pods/eviction"]
  verbs: ["create"]
- apiGroups: ["apps"]
  resources: ["replicasets", "statefulsets"]
  verbs: ["get"]
- apiGroups: ["policy"]
  resources: ["poddisruptionbudgets"]
  verbs: ["get"]

五、镜像与网络优化

5.1 镜像优化策略

多阶段构建

# 不好的做法 - 包含构建工具
FROM golang:1.21
WORKDIR /app
COPY . .
RUN go build -o app
CMD ["./app"]

# 好的做法 - 多阶段构建
FROM golang:1.21 AS builder
WORKDIR /app
COPY . .
RUN go build -o app

FROM alpine:3.18
WORKDIR /app
COPY --from=builder /app/app .
CMD ["./app"]

镜像层优化

# 不好 - 每个 RUN 创建一层
FROM ubuntu:22.04
RUN apt-get update
RUN apt-get install -y curl
RUN apt-get install -y vim
RUN apt-get clean

# 好 - 合并命令减少层数
FROM ubuntu:22.04
RUN apt-get update && \
    apt-get install -y curl vim && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

5.2 镜像拉取策略

apiVersion: v1
kind: Pod
metadata:
  name: optimized-pod
spec:
  containers:
  - name: app
    image: myregistry.com/app:v1.0.0
    imagePullPolicy: IfNotPresent  # 如果本地有则不拉取
  
  # 使用私有仓库
  imagePullSecrets:
  - name: registry-secret

5.3 网络优化配置

eBPF 加速（Cilium）

apiVersion: v1
kind: ConfigMap
metadata:
  name: cilium-config
  namespace: kube-system
data:
  enable-ipv4: "true"
  enable-ipv6: "false"
  enable-bpf-masquerade: "true"
  enable-endpoint-routes: "true"
  enable-health-checking: "true"
  enable-policy: "default"
  enable-l7-proxy: "true"
  tunnel: "disabled"
  ipam: "kubernetes"
  kube-proxy-replacement: "strict"
  enable-host-reachable-services: "true"

六、成本监控与分析

6.1 Kubecost 部署

apiVersion: v1
kind: Namespace
metadata:
  name: kubecost
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: kubecost
  namespace: kubecost
spec:
  replicas: 1
  selector:
    matchLabels:
      app: kubecost
  template:
    metadata:
      labels:
        app: kubecost
    spec:
      containers:
      - name: kubecost
        image: gcr.io/kubecost1/cost-model:latest
        ports:
        - containerPort: 9090
        env:
        - name: PROMETHEUS_SERVER_ENDPOINT
          value: "http://prometheus:9090"
        - name: CLOUD_PROVIDER_API_KEY
          valueFrom:
            secretKeyRef:
              name: kubecost-secrets
              key: cloud-api-key
        resources:
          requests:
            cpu: 200m
            memory: 512Mi
          limits:
            cpu: 500m
            memory: 1Gi
---
apiVersion: v1
kind: Service
metadata:
  name: kubecost
  namespace: kubecost
spec:
  selector:
    app: kubecost
  ports:
  - port: 9090
    targetPort: 9090
  type: LoadBalancer

6.2 成本分析查询

# 获取命名空间成本
kubectl port-forward -n kubecost svc/kubecost 9090:9090
curl http://localhost:9090/model/allocation?window=7d&aggregate=namespace

# 获取标签成本
curl http://localhost:9090/model/allocation?window=7d&aggregate=label:app

# 获取节点成本
curl http://localhost:9090/model/allocation?window=7d&aggregate=node

# 导出成本报告
curl http://localhost:9090/model/costDataModel?timeWindow=month > cost-report.json

️ 七、命令速记

资源查看命令

# 查看节点资源使用
kubectl top nodes

# 查看 Pod 资源使用
kubectl top pods -A

# 查看资源配额
kubectl get resourcequota -A

# 查看限制范围
kubectl get limitrange -A

# 查看 VPA 推荐
kubectl get vpa -A

# 查看节点标签
kubectl get nodes --show-labels

成本优化命令

# 查找未使用的 PVC
kubectl get pvc -A | grep -v Bound

# 查找未调度的 Pod
kubectl get pods -A --field-selector=status.phase=Pending

# 查找资源使用率低的节点
kubectl top nodes --sort-by=cpu

# 查找长时间运行的 Pod
kubectl get pods -A --sort-by=.metadata.creationTimestamp

# 统计资源请求
kubectl get pods -A -o=jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[*].resources.requests.cpu}{"\n"}{end}'

Descheduler 命令

# 手动运行 Descheduler
kubectl create job --from=cronjob/descheduler -n kube-system descheduler-manual

# 查看 Descheduler 日志
kubectl logs -n kube-system -l job-name=descheduler-manual

# 查看被驱逐的 Pod
kubectl get events -A | grep Evicted

八、面试核心问答

Q1: 如何优化 Kubernetes 的资源利用率？

答案要点：

使用 VPA 获取资源推荐值
配置合理的 Requests/Limits
使用 Descheduler 重平衡负载
启用 Cluster Autoscaler
混用 Spot 和按需实例
监控资源使用趋势

Q2: Spot 实例的最佳实践是什么？

答案要点：

仅用于无状态和容错工作负载
配置 Pod 优先级和抢占
使用多个 Spot 池分散风险
实现优雅关闭
监控 Spot 中断通知
准备回退到按需实例

Q3: 如何降低 Kubernetes 集群成本？

答案要点：

精确设置资源请求
使用 Spot 实例
启用 Cluster Autoscaler
优化镜像大小
使用节点池分层
定期清理未使用资源

Q4: Descheduler 的作用是什么？

答案要点：

重新平衡节点负载
移除违反策略的 Pod
优化资源利用率
处理节点漂移
定期清理异常 Pod

Q5: 如何监控和分析 Kubernetes 成本？

答案要点：

使用 Kubecost 分析成本
按命名空间/标签分摊成本
监控资源浪费
设置成本预算和告警
定期生成成本报告
优化资源分配

九、故障排查

常见成本问题

1. 资源过度分配

# 查找 Requests 过高的 Pod
kubectl get pods -A -o=custom-columns=NAME:.metadata.name,NAMESPACE:.metadata.namespace,CPU_REQ:.spec.containers[*].resources.requests.cpu,MEM_REQ:.spec.containers[*].resources.requests.memory

# 对比实际使用
kubectl top pods -A

# 查看 VPA 推荐
kubectl get vpa -A -o yaml | grep -A 10 recommendation

2. Spot 实例频繁中断

# 查看节点事件
kubectl get events --field-selector involvedObject.kind=Node

# 查看被驱逐的 Pod
kubectl get pods -A --field-selector=status.phase=Failed

# 查看节点污点
kubectl describe nodes | grep Taints

3. Cluster Autoscaler 不工作

# 查看 CA 日志
kubectl logs -n kube-system -l app=cluster-autoscaler

# 查看 CA 状态
kubectl get configmap cluster-autoscaler-status -n kube-system -o yaml

# 检查节点组配置
kubectl describe nodes | grep autoscaler

十、最佳实践

成本优化建议

资源规划
- 基于实际使用设置 Requests
- 避免过度预留资源
- 定期审查资源配置
- 使用 VPA 推荐值
节点优化
- 混用多种实例类型
- Spot 实例占比 > 50%
- 启用 Cluster Autoscaler
- 设置合理的缩容策略
工作负载优化
- 区分关键和非关键业务
- 批处理任务用 Spot
- 关键业务用按需实例
- 使用 PDB 保护关键 Pod
监控与分析
- 部署成本监控工具
- 设置成本预算
- 定期生成报告
- 持续优化调整

FinOps 实施建议

文化建设
- 建立成本意识
- 成本责任到团队
- 定期成本回顾
- 优化激励机制
流程优化
- 资源申请审批
- 成本预算管理
- 异常成本告警
- 优化建议跟进
工具集成
- CI/CD 集成成本检查
- 自动化资源清理
- 成本可视化
- 预测性分析

十一、总结

通过本模块学习，你已经掌握了：

Kubernetes 成本优化策略
资源容量规划方法
Requests/Limits 精细化调优
节点池分层与 Spot 实例
Descheduler 资源重平衡
成本监控与分析
FinOps 最佳实践

下一步建议：继续学习 10-故障排查，深入了解 Kubernetes 常见问题诊断与解决方案。