09-成本与容量
Kubernetes FinOps 与资源优化全解析
学习目标
通过本模块学习,你将掌握:
- Kubernetes 成本优化策略
- 资源容量规划方法
- Requests/Limits 精细化调优
- 节点池分层与 Spot 实例
- Descheduler 资源重平衡
- FinOps 最佳实践
一、FinOps 架构概览
成本优化体系
┌─────────────────────────────────────────────────────────────┐
│ FinOps 成本优化体系 │
├─────────────────────────────────────────────────────────────┤
│ 资源层 │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ CPU/内存 │ │ 存储 │ │ 网络 │ │
│ │ 优化 │ │ 优化 │ │ 优化 │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
├─────────────────────────────────────────────────────────────┤
│ 节点层 │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ 按需实例 │ │ Spot实例 │ │ 预留实例 │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
├─────────────────────────────────────────────────────────────┤
│ 调度层 │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Bin Packing │ │ Descheduler │ │ Autoscaler │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────┘
成本优化关键指标
指标 | 说明 | 目标值 |
---|---|---|
资源利用率 | CPU/内存实际使用率 | > 70% |
成本单位 | 每月每核成本 | 持续降低 |
浪费率 | 未使用资源占比 | < 20% |
Spot 占比 | Spot 实例使用比例 | > 50% |
二、资源优化策略
2.1 Requests/Limits 调优
资源配置原则
# 不同 QoS 的资源配置策略
# Guaranteed - 关键业务
apiVersion: v1
kind: Pod
metadata:
name: critical-app
spec:
containers:
- name: app
image: app:latest
resources:
requests:
cpu: "1000m"
memory: "2Gi"
limits:
cpu: "1000m" # requests == limits
memory: "2Gi"
# Burstable - 一般业务
apiVersion: v1
kind: Pod
metadata:
name: normal-app
spec:
containers:
- name: app
image: app:latest
resources:
requests:
cpu: "500m"
memory: "1Gi"
limits:
cpu: "2000m" # 允许突发
memory: "4Gi"
# BestEffort - 批处理任务
apiVersion: v1
kind: Pod
metadata:
name: batch-job
spec:
containers:
- name: job
image: job:latest
# 无 resources 配置
2.2 VPA Recommender 使用
VPA 推荐模式配置
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: app-vpa
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app
updatePolicy:
updateMode: "Off" # 仅推荐,不自动更新
resourcePolicy:
containerPolicies:
- containerName: '*'
minAllowed:
cpu: 100m
memory: 128Mi
maxAllowed:
cpu: 2000m
memory: 4Gi
controlledResources: ["cpu", "memory"]
controlledValues: RequestsAndLimits
获取 VPA 推荐值
# 查看 VPA 推荐
kubectl get vpa app-vpa -o jsonpath='{.status.recommendation}'
# 格式化输出推荐值
kubectl get vpa app-vpa -o jsonpath='{.status.recommendation.containerRecommendations[*]}' | jq
2.3 LimitRange 资源限制
apiVersion: v1
kind: LimitRange
metadata:
name: resource-limits
namespace: production
spec:
limits:
# Container 级别限制
- type: Container
default:
cpu: 500m
memory: 512Mi
defaultRequest:
cpu: 100m
memory: 128Mi
max:
cpu: 2000m
memory: 2Gi
min:
cpu: 50m
memory: 64Mi
# Pod 级别限制
- type: Pod
max:
cpu: 4000m
memory: 8Gi
min:
cpu: 100m
memory: 128Mi
# PVC 限制
- type: PersistentVolumeClaim
max:
storage: 100Gi
min:
storage: 1Gi
2.4 ResourceQuota 配额管理
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-quota
namespace: team-a
spec:
hard:
# 计算资源
requests.cpu: "20"
requests.memory: 40Gi
limits.cpu: "40"
limits.memory: 80Gi
# 存储资源
requests.storage: 100Gi
persistentvolumeclaims: "10"
# 对象数量
pods: "50"
services: "20"
secrets: "30"
configmaps: "30"
# LoadBalancer 限制
services.loadbalancers: "3"
# NodePort 限制
services.nodeports: "5"
# 作用域
scopeSelector:
matchExpressions:
- operator: In
scopeName: PriorityClass
values: ["high", "medium"]
三、节点池优化策略
3.1 节点池分层设计
# 高性能节点池 - 关键业务
apiVersion: v1
kind: Node
metadata:
name: high-perf-node-1
labels:
node-type: high-performance
workload-type: critical
cost-tier: premium
spec:
taints:
- key: critical-only
value: "true"
effect: NoSchedule
---
# 通用节点池 - 一般业务
apiVersion: v1
kind: Node
metadata:
name: general-node-1
labels:
node-type: general
workload-type: standard
cost-tier: standard
---
# Spot 节点池 - 批处理任务
apiVersion: v1
kind: Node
metadata:
name: spot-node-1
labels:
node-type: spot
workload-type: batch
cost-tier: spot
annotations:
cluster-autoscaler.kubernetes.io/scale-down-disabled: "false"
spec:
taints:
- key: spot-instance
value: "true"
effect: NoSchedule
3.2 Pod 节点池选择策略
# 关键业务 - 高性能节点
apiVersion: apps/v1
kind: Deployment
metadata:
name: critical-app
spec:
replicas: 3
selector:
matchLabels:
app: critical
template:
metadata:
labels:
app: critical
spec:
nodeSelector:
node-type: high-performance
cost-tier: premium
tolerations:
- key: critical-only
operator: Equal
value: "true"
effect: NoSchedule
containers:
- name: app
image: critical-app:latest
resources:
requests:
cpu: "2000m"
memory: "4Gi"
limits:
cpu: "2000m"
memory: "4Gi"
---
# 批处理任务 - Spot 节点
apiVersion: batch/v1
kind: Job
metadata:
name: batch-job
spec:
template:
spec:
nodeSelector:
node-type: spot
cost-tier: spot
tolerations:
- key: spot-instance
operator: Equal
value: "true"
effect: NoSchedule
containers:
- name: job
image: batch-job:latest
resources:
requests:
cpu: "1000m"
memory: "2Gi"
restartPolicy: OnFailure
3.3 Cluster Autoscaler 配置
apiVersion: apps/v1
kind: Deployment
metadata:
name: cluster-autoscaler
namespace: kube-system
spec:
replicas: 1
selector:
matchLabels:
app: cluster-autoscaler
template:
metadata:
labels:
app: cluster-autoscaler
spec:
serviceAccountName: cluster-autoscaler
containers:
- name: cluster-autoscaler
image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.28.0
command:
- ./cluster-autoscaler
- --cloud-provider=aws
- --namespace=kube-system
- --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/my-cluster
- --balance-similar-node-groups
- --skip-nodes-with-system-pods=false
- --scale-down-enabled=true
- --scale-down-delay-after-add=10m
- --scale-down-unneeded-time=10m
- --scale-down-utilization-threshold=0.5
- --max-node-provision-time=15m
- --expander=least-waste
env:
- name: AWS_REGION
value: us-east-1
resources:
requests:
cpu: 100m
memory: 300Mi
limits:
cpu: 100m
memory: 300Mi
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: cluster-autoscaler
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: cluster-autoscaler
rules:
- apiGroups: [""]
resources: ["events", "endpoints"]
verbs: ["create", "patch"]
- apiGroups: [""]
resources: ["pods/eviction"]
verbs: ["create"]
- apiGroups: [""]
resources: ["pods/status"]
verbs: ["update"]
- apiGroups: [""]
resources: ["endpoints"]
resourceNames: ["cluster-autoscaler"]
verbs: ["get", "update"]
- apiGroups: [""]
resources: ["nodes"]
verbs: ["watch", "list", "get", "update"]
- apiGroups: [""]
resources: ["pods", "services", "replicationcontrollers", "persistentvolumeclaims", "persistentvolumes"]
verbs: ["watch", "list", "get"]
- apiGroups: ["extensions"]
resources: ["replicasets", "daemonsets"]
verbs: ["watch", "list", "get"]
- apiGroups: ["policy"]
resources: ["poddisruptionbudgets"]
verbs: ["watch", "list"]
- apiGroups: ["apps"]
resources: ["statefulsets", "replicasets", "daemonsets"]
verbs: ["watch", "list", "get"]
- apiGroups: ["storage.k8s.io"]
resources: ["storageclasses", "csinodes", "csidrivers", "csistoragecapacities"]
verbs: ["watch", "list", "get"]
- apiGroups: ["batch"]
resources: ["jobs", "cronjobs"]
verbs: ["watch", "list", "get"]
- apiGroups: ["coordination.k8s.io"]
resources: ["leases"]
verbs: ["create"]
- apiGroups: ["coordination.k8s.io"]
resourceNames: ["cluster-autoscaler"]
resources: ["leases"]
verbs: ["get", "update"]
四、Descheduler 资源重平衡
4.1 Descheduler 部署
apiVersion: v1
kind: ConfigMap
metadata:
name: descheduler-policy
namespace: kube-system
data:
policy.yaml: |
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
# 移除违反亲和性的 Pod
"RemovePodsViolatingInterPodAntiAffinity":
enabled: true
# 移除违反节点亲和性的 Pod
"RemovePodsViolatingNodeAffinity":
enabled: true
params:
nodeAffinityType:
- "requiredDuringSchedulingIgnoredDuringExecution"
# 移除违反污点的 Pod
"RemovePodsViolatingNodeTaints":
enabled: true
# 低利用率节点的 Pod 重调度
"LowNodeUtilization":
enabled: true
params:
nodeResourceUtilizationThresholds:
thresholds:
cpu: 20
memory: 20
pods: 20
targetThresholds:
cpu: 50
memory: 50
pods: 50
# 移除重复的 Pod
"RemoveDuplicates":
enabled: true
# 移除失败的 Pod
"RemovePodsHavingTooManyRestarts":
enabled: true
params:
podsHavingTooManyRestarts:
podRestartThreshold: 100
includingInitContainers: true
# Pod 生命周期管理
"PodLifeTime":
enabled: true
params:
podLifeTime:
maxPodLifeTimeSeconds: 86400 # 24小时
---
apiVersion: batch/v1
kind: CronJob
metadata:
name: descheduler
namespace: kube-system
spec:
schedule: "*/30 * * * *" # 每30分钟运行一次
concurrencyPolicy: Forbid
jobTemplate:
spec:
template:
spec:
serviceAccountName: descheduler
containers:
- name: descheduler
image: registry.k8s.io/descheduler/descheduler:v0.28.0
command:
- /bin/descheduler
- --policy-config-file=/policy/policy.yaml
- --descheduling-interval=5m
- --v=3
volumeMounts:
- name: policy
mountPath: /policy
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 256Mi
restartPolicy: Never
volumes:
- name: policy
configMap:
name: descheduler-policy
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: descheduler
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: descheduler
rules:
- apiGroups: [""]
resources: ["events"]
verbs: ["create", "update"]
- apiGroups: [""]
resources: ["nodes"]
verbs: ["get", "watch", "list"]
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "watch", "list", "delete"]
- apiGroups: [""]
resources: ["pods/eviction"]
verbs: ["create"]
- apiGroups: ["apps"]
resources: ["replicasets", "statefulsets"]
verbs: ["get"]
- apiGroups: ["policy"]
resources: ["poddisruptionbudgets"]
verbs: ["get"]
五、镜像与网络优化
5.1 镜像优化策略
多阶段构建
# 不好的做法 - 包含构建工具
FROM golang:1.21
WORKDIR /app
COPY . .
RUN go build -o app
CMD ["./app"]
# 好的做法 - 多阶段构建
FROM golang:1.21 AS builder
WORKDIR /app
COPY . .
RUN go build -o app
FROM alpine:3.18
WORKDIR /app
COPY /app/app .
CMD ["./app"]
镜像层优化
# 不好 - 每个 RUN 创建一层
FROM ubuntu:22.04
RUN apt-get update
RUN apt-get install -y curl
RUN apt-get install -y vim
RUN apt-get clean
# 好 - 合并命令减少层数
FROM ubuntu:22.04
RUN apt-get update && \
apt-get install -y curl vim && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
5.2 镜像拉取策略
apiVersion: v1
kind: Pod
metadata:
name: optimized-pod
spec:
containers:
- name: app
image: myregistry.com/app:v1.0.0
imagePullPolicy: IfNotPresent # 如果本地有则不拉取
# 使用私有仓库
imagePullSecrets:
- name: registry-secret
5.3 网络优化配置
eBPF 加速(Cilium)
apiVersion: v1
kind: ConfigMap
metadata:
name: cilium-config
namespace: kube-system
data:
enable-ipv4: "true"
enable-ipv6: "false"
enable-bpf-masquerade: "true"
enable-endpoint-routes: "true"
enable-health-checking: "true"
enable-policy: "default"
enable-l7-proxy: "true"
tunnel: "disabled"
ipam: "kubernetes"
kube-proxy-replacement: "strict"
enable-host-reachable-services: "true"
六、成本监控与分析
6.1 Kubecost 部署
apiVersion: v1
kind: Namespace
metadata:
name: kubecost
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: kubecost
namespace: kubecost
spec:
replicas: 1
selector:
matchLabels:
app: kubecost
template:
metadata:
labels:
app: kubecost
spec:
containers:
- name: kubecost
image: gcr.io/kubecost1/cost-model:latest
ports:
- containerPort: 9090
env:
- name: PROMETHEUS_SERVER_ENDPOINT
value: "http://prometheus:9090"
- name: CLOUD_PROVIDER_API_KEY
valueFrom:
secretKeyRef:
name: kubecost-secrets
key: cloud-api-key
resources:
requests:
cpu: 200m
memory: 512Mi
limits:
cpu: 500m
memory: 1Gi
---
apiVersion: v1
kind: Service
metadata:
name: kubecost
namespace: kubecost
spec:
selector:
app: kubecost
ports:
- port: 9090
targetPort: 9090
type: LoadBalancer
6.2 成本分析查询
# 获取命名空间成本
kubectl port-forward -n kubecost svc/kubecost 9090:9090
curl http://localhost:9090/model/allocation?window=7d&aggregate=namespace
# 获取标签成本
curl http://localhost:9090/model/allocation?window=7d&aggregate=label:app
# 获取节点成本
curl http://localhost:9090/model/allocation?window=7d&aggregate=node
# 导出成本报告
curl http://localhost:9090/model/costDataModel?timeWindow=month > cost-report.json
️ 七、命令速记
资源查看命令
# 查看节点资源使用
kubectl top nodes
# 查看 Pod 资源使用
kubectl top pods -A
# 查看资源配额
kubectl get resourcequota -A
# 查看限制范围
kubectl get limitrange -A
# 查看 VPA 推荐
kubectl get vpa -A
# 查看节点标签
kubectl get nodes --show-labels
成本优化命令
# 查找未使用的 PVC
kubectl get pvc -A | grep -v Bound
# 查找未调度的 Pod
kubectl get pods -A --field-selector=status.phase=Pending
# 查找资源使用率低的节点
kubectl top nodes --sort-by=cpu
# 查找长时间运行的 Pod
kubectl get pods -A --sort-by=.metadata.creationTimestamp
# 统计资源请求
kubectl get pods -A -o=jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[*].resources.requests.cpu}{"\n"}{end}'
Descheduler 命令
# 手动运行 Descheduler
kubectl create job --from=cronjob/descheduler -n kube-system descheduler-manual
# 查看 Descheduler 日志
kubectl logs -n kube-system -l job-name=descheduler-manual
# 查看被驱逐的 Pod
kubectl get events -A | grep Evicted
八、面试核心问答
Q1: 如何优化 Kubernetes 的资源利用率?
答案要点:
- 使用 VPA 获取资源推荐值
- 配置合理的 Requests/Limits
- 使用 Descheduler 重平衡负载
- 启用 Cluster Autoscaler
- 混用 Spot 和按需实例
- 监控资源使用趋势
Q2: Spot 实例的最佳实践是什么?
答案要点:
- 仅用于无状态和容错工作负载
- 配置 Pod 优先级和抢占
- 使用多个 Spot 池分散风险
- 实现优雅关闭
- 监控 Spot 中断通知
- 准备回退到按需实例
Q3: 如何降低 Kubernetes 集群成本?
答案要点:
- 精确设置资源请求
- 使用 Spot 实例
- 启用 Cluster Autoscaler
- 优化镜像大小
- 使用节点池分层
- 定期清理未使用资源
Q4: Descheduler 的作用是什么?
答案要点:
- 重新平衡节点负载
- 移除违反策略的 Pod
- 优化资源利用率
- 处理节点漂移
- 定期清理异常 Pod
Q5: 如何监控和分析 Kubernetes 成本?
答案要点:
- 使用 Kubecost 分析成本
- 按命名空间/标签分摊成本
- 监控资源浪费
- 设置成本预算和告警
- 定期生成成本报告
- 优化资源分配
九、故障排查
常见成本问题
1. 资源过度分配
# 查找 Requests 过高的 Pod
kubectl get pods -A -o=custom-columns=NAME:.metadata.name,NAMESPACE:.metadata.namespace,CPU_REQ:.spec.containers[*].resources.requests.cpu,MEM_REQ:.spec.containers[*].resources.requests.memory
# 对比实际使用
kubectl top pods -A
# 查看 VPA 推荐
kubectl get vpa -A -o yaml | grep -A 10 recommendation
2. Spot 实例频繁中断
# 查看节点事件
kubectl get events --field-selector involvedObject.kind=Node
# 查看被驱逐的 Pod
kubectl get pods -A --field-selector=status.phase=Failed
# 查看节点污点
kubectl describe nodes | grep Taints
3. Cluster Autoscaler 不工作
# 查看 CA 日志
kubectl logs -n kube-system -l app=cluster-autoscaler
# 查看 CA 状态
kubectl get configmap cluster-autoscaler-status -n kube-system -o yaml
# 检查节点组配置
kubectl describe nodes | grep autoscaler
十、最佳实践
成本优化建议
资源规划
- 基于实际使用设置 Requests
- 避免过度预留资源
- 定期审查资源配置
- 使用 VPA 推荐值
节点优化
- 混用多种实例类型
- Spot 实例占比 > 50%
- 启用 Cluster Autoscaler
- 设置合理的缩容策略
工作负载优化
- 区分关键和非关键业务
- 批处理任务用 Spot
- 关键业务用按需实例
- 使用 PDB 保护关键 Pod
监控与分析
- 部署成本监控工具
- 设置成本预算
- 定期生成报告
- 持续优化调整
FinOps 实施建议
文化建设
- 建立成本意识
- 成本责任到团队
- 定期成本回顾
- 优化激励机制
流程优化
- 资源申请审批
- 成本预算管理
- 异常成本告警
- 优化建议跟进
工具集成
- CI/CD 集成成本检查
- 自动化资源清理
- 成本可视化
- 预测性分析
十一、总结
通过本模块学习,你已经掌握了:
- Kubernetes 成本优化策略
- 资源容量规划方法
- Requests/Limits 精细化调优
- 节点池分层与 Spot 实例
- Descheduler 资源重平衡
- 成本监控与分析
- FinOps 最佳实践
下一步建议:继续学习 10-故障排查,深入了解 Kubernetes 常见问题诊断与解决方案。