10-故障排查

Kubernetes 常见问题诊断与解决方案全指南

学习目标

通过本模块学习，你将掌握：

Pod 各种异常状态的诊断方法
网络故障排查技能
节点异常处理流程
存储问题诊断技巧
调试工具的使用
系统化排查思路

一、故障排查体系

排查思路框架

graph TD
    A[发现问题] --> B{问题类型}
    B -->|Pod 问题| C[Pod 诊断]
    B -->|网络问题| D[网络诊断]
    B -->|节点问题| E[节点诊断]
    B -->|存储问题| F[存储诊断]
    C --> G[检查日志]
    D --> G
    E --> G
    F --> G
    G --> H[分析根因]
    H --> I[实施修复]
    I --> J[验证结果]

故障分类

类别	常见问题	优先级
Pod 故障	Pending, CrashLoopBackOff, ImagePullBackOff	P0/P1
网络故障	DNS 解析失败, Service 不可达	P0/P1
节点故障	NotReady, 资源不足	P0
存储故障	PVC 无法绑定, 挂载失败	P1/P2

二、Pod 故障排查

2.1 Pod Pending 状态

常见原因

资源不足（CPU/内存）
调度约束冲突（亲和性/污点）
PVC 未绑定
镜像拉取失败

排查步骤

# 1. 查看 Pod 详情
kubectl describe pod <pod-name> -n <namespace>

# 2. 查看事件
kubectl get events -n <namespace> --sort-by='.lastTimestamp'

# 3. 检查节点资源
kubectl top nodes
kubectl describe nodes

# 4. 检查调度约束
kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 20 affinity
kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 10 tolerations

# 5. 检查 PVC 状态
kubectl get pvc -n <namespace>

解决方案示例

# 问题：资源不足
# 解决：调整资源请求或扩容节点

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  containers:
  - name: app
    image: myapp:latest
    resources:
      requests:
        cpu: "100m"      # 降低资源请求
        memory: "128Mi"
      limits:
        cpu: "500m"
        memory: "512Mi"

2.2 CrashLoopBackOff 状态

排查步骤

# 1. 查看容器日志
kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous  # 查看上一个容器的日志

# 2. 查看多容器的日志
kubectl logs <pod-name> -c <container-name> -n <namespace>

# 3. 查看 Pod 事件
kubectl describe pod <pod-name> -n <namespace>

# 4. 检查启动命令
kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 10 command

# 5. 使用 kubectl debug 进入调试
kubectl debug <pod-name> -n <namespace> -it --image=busybox --target=<container-name>

常见问题和解决方案

问题 1：应用启动失败

# 问题：startupProbe 超时
# 解决：延长 startupProbe 时间

apiVersion: v1
kind: Pod
metadata:
  name: slow-start-app
spec:
  containers:
  - name: app
    image: slow-app:latest
    startupProbe:
      httpGet:
        path: /health
        port: 8080
      initialDelaySeconds: 30    # 增加初始延迟
      periodSeconds: 10
      failureThreshold: 30        # 增加失败阈值（总共 5 分钟）
    livenessProbe:
      httpGet:
        path: /health
        port: 8080
      periodSeconds: 10

问题 2：配置错误

# 检查环境变量
kubectl exec <pod-name> -n <namespace> -- env

# 检查配置文件
kubectl exec <pod-name> -n <namespace> -- cat /etc/config/app.conf

# 检查挂载卷
kubectl exec <pod-name> -n <namespace> -- ls -la /mnt/data

2.3 ImagePullBackOff 状态

排查步骤

# 1. 查看详细错误信息
kubectl describe pod <pod-name> -n <namespace>

# 2. 检查镜像名称
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[*].image}'

# 3. 检查镜像拉取密钥
kubectl get secrets -n <namespace>
kubectl describe secret <secret-name> -n <namespace>

# 4. 在节点上手动拉取镜像
ssh node1
crictl pull <image-name>

解决方案

# 解决方案 1：添加 imagePullSecrets
apiVersion: v1
kind: Pod
metadata:
  name: private-registry-pod
spec:
  imagePullSecrets:
  - name: registry-credentials
  containers:
  - name: app
    image: private.registry.com/myapp:v1.0

---
# 创建镜像拉取密钥
apiVersion: v1
kind: Secret
metadata:
  name: registry-credentials
type: kubernetes.io/dockerconfigjson
data:
  .dockerconfigjson: <base64-encoded-docker-config>

# 创建密钥命令
# kubectl create secret docker-registry registry-credentials \
#   --docker-server=private.registry.com \
#   --docker-username=myuser \
#   --docker-password=mypassword \
#   --docker-email=myemail@example.com

2.4 Evicted 状态

排查步骤

# 1. 查看被驱逐的 Pod
kubectl get pods -A | grep Evicted

# 2. 查看驱逐原因
kubectl describe pod <evicted-pod> -n <namespace>

# 3. 检查节点资源压力
kubectl describe nodes | grep -A 5 "Conditions:"

# 4. 检查磁盘使用
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.conditions[?(@.type=="DiskPressure")].status}{"\n"}{end}'

清理被驱逐的 Pod

# 删除所有 Evicted 状态的 Pod
kubectl get pods -A | grep Evicted | awk '{print $2 " -n " $1}' | xargs kubectl delete pod

# 或使用脚本
for ns in $(kubectl get ns -o jsonpath='{.items[*].metadata.name}'); do
  kubectl get pods -n $ns --field-selector=status.phase=Failed -o name | \
    xargs kubectl delete -n $ns
done

三、网络故障排查

3.1 DNS 解析失败

排查步骤

# 1. 检查 CoreDNS 状态
kubectl get pods -n kube-system -l k8s-app=kube-dns

# 2. 查看 CoreDNS 日志
kubectl logs -n kube-system -l k8s-app=kube-dns

# 3. 测试 DNS 解析
kubectl run -it --rm debug --image=busybox --restart=Never -- nslookup kubernetes.default

# 4. 检查 DNS 配置
kubectl get configmap coredns -n kube-system -o yaml

# 5. 测试从 Pod 内解析
kubectl exec -it <pod-name> -- nslookup kubernetes.default
kubectl exec -it <pod-name> -- cat /etc/resolv.conf

解决方案

# CoreDNS ConfigMap 配置
apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns
  namespace: kube-system
data:
  Corefile: |
    .:53 {
        errors
        health {
            lameduck 5s
        }
        ready
        kubernetes cluster.local in-addr.arpa ip6.arpa {
            pods insecure
            fallthrough in-addr.arpa ip6.arpa
            ttl 30
        }
        prometheus :9153
        forward . /etc/resolv.conf {
            max_concurrent 1000
        }
        cache 30
        loop
        reload
        loadbalance
    }

3.2 Service 不可达

排查步骤

# 1. 检查 Service
kubectl get svc <service-name> -n <namespace>
kubectl describe svc <service-name> -n <namespace>

# 2. 检查 Endpoints
kubectl get endpoints <service-name> -n <namespace>
kubectl get endpointslices -n <namespace> -l kubernetes.io/service-name=<service-name>

# 3. 检查 Pod 标签
kubectl get pods -n <namespace> --show-labels

# 4. 测试 Service 连接
kubectl run -it --rm debug --image=nicolaka/netshoot --restart=Never -- bash
# 在容器内执行
curl http://<service-name>.<namespace>.svc.cluster.local

# 5. 检查 kube-proxy
kubectl get pods -n kube-system -l k8s-app=kube-proxy
kubectl logs -n kube-system -l k8s-app=kube-proxy

iptables/IPVS 调试

# 查看 iptables 规则
iptables -t nat -L KUBE-SERVICES | grep <service-name>

# 查看 IPVS 规则
ipvsadm -ln | grep <service-ip>

# 查看 conntrack 表
conntrack -L | grep <service-ip>

3.3 NetworkPolicy 问题

排查步骤

# 1. 查看 NetworkPolicy
kubectl get networkpolicy -n <namespace>
kubectl describe networkpolicy <policy-name> -n <namespace>

# 2. 测试连通性
kubectl run test-source --image=busybox --rm -it -- sh
# wget -O- http://target-service

# 3. 检查 CNI 插件
kubectl get pods -n kube-system | grep calico
kubectl logs -n kube-system <calico-pod>

# 4. 验证策略匹配
kubectl get pods -n <namespace> --show-labels

NetworkPolicy 调试示例

# 临时允许所有流量进行测试
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-all-temporarily
  namespace: production
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - {}
  egress:
  - {}

️ 四、节点故障排查

4.1 Node NotReady 状态

排查步骤

# 1. 查看节点状态
kubectl get nodes
kubectl describe node <node-name>

# 2. 检查节点 Conditions
kubectl get node <node-name> -o jsonpath='{.status.conditions[*]}'

# 3. 检查 kubelet 状态
ssh <node-name>
systemctl status kubelet
journalctl -u kubelet -f

# 4. 检查容器运行时
systemctl status containerd
crictl ps
crictl info

# 5. 检查网络插件
kubectl get pods -n kube-system | grep calico

常见问题和解决方案

问题 1：kubelet 无法启动

# 查看 kubelet 日志
journalctl -u kubelet -n 100

# 检查配置
cat /var/lib/kubelet/config.yaml

# 重启 kubelet
systemctl restart kubelet

问题 2：磁盘压力

# 检查磁盘使用
df -h

# 清理 Docker/containerd 数据
crictl rmi --prune
docker system prune -a

# 清理日志
find /var/log -name "*.log" -mtime +7 -delete
journalctl --vacuum-time=3d

4.2 节点资源耗尽

排查步骤

# 1. 查看资源使用
kubectl top nodes
kubectl top pods -A --sort-by=cpu
kubectl top pods -A --sort-by=memory

# 2. 查看节点详细信息
kubectl describe node <node-name> | grep -A 10 "Allocated resources"

# 3. 识别资源消耗大户
kubectl get pods -A -o=custom-columns=NAME:.metadata.name,NAMESPACE:.metadata.namespace,CPU_REQ:.spec.containers[*].resources.requests.cpu,MEM_REQ:.spec.containers[*].resources.requests.memory,CPU_LIM:.spec.containers[*].resources.limits.cpu,MEM_LIM:.spec.containers[*].resources.limits.memory

4.3 节点维护流程

安全维护步骤

# 1. 标记节点不可调度
kubectl cordon <node-name>

# 2. 驱逐 Pod
kubectl drain <node-name> \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --force \
  --grace-period=300

# 3. 在节点上执行维护操作
ssh <node-name>
# 执行维护任务（升级、修复等）

# 4. 恢复节点
kubectl uncordon <node-name>

# 5. 验证节点状态
kubectl get node <node-name>
kubectl describe node <node-name>

五、存储故障排查

5.1 PVC 无法绑定

排查步骤

# 1. 查看 PVC 状态
kubectl get pvc -n <namespace>
kubectl describe pvc <pvc-name> -n <namespace>

# 2. 查看可用 PV
kubectl get pv

# 3. 检查 StorageClass
kubectl get storageclass
kubectl describe storageclass <sc-name>

# 4. 检查 CSI 驱动
kubectl get pods -n kube-system | grep csi

# 5. 查看 PV 详情
kubectl describe pv <pv-name>

解决方案示例

# 问题：找不到匹配的 PV
# 解决：创建匹配的 PV 或使用动态供应

apiVersion: v1
kind: PersistentVolume
metadata:
  name: manual-pv
spec:
  capacity:
    storage: 10Gi
  accessModes:
  - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  storageClassName: manual  # 确保与 PVC 的 storageClassName 匹配
  hostPath:
    path: /mnt/data

5.2 Volume 挂载失败

排查步骤

# 1. 查看 Pod 事件
kubectl describe pod <pod-name> -n <namespace>

# 2. 检查 PVC 绑定状态
kubectl get pvc -n <namespace>

# 3. 查看 kubelet 日志
ssh <node-name>
journalctl -u kubelet | grep -i "mount\|volume"

# 4. 检查节点挂载点
df -h
mount | grep <volume-name>

# 5. 检查 CSI 驱动日志
kubectl logs -n kube-system -l app=csi-driver

六、调试工具使用

6.1 kubectl debug

# 创建调试容器
kubectl debug <pod-name> -it --image=busybox --target=<container-name>

# 复制 Pod 进行调试
kubectl debug <pod-name> -it --copy-to=<debug-pod-name> --container=<container-name>

# 使用临时容器
kubectl debug <pod-name> -it --image=nicolaka/netshoot

# 调试节点
kubectl debug node/<node-name> -it --image=ubuntu

6.2 crictl 容器运行时调试

# 列出容器
crictl ps -a

# 查看容器日志
crictl logs <container-id>

# 检查容器
crictl inspect <container-id>

# 执行命令
crictl exec -it <container-id> sh

# 查看镜像
crictl images

# 拉取镜像
crictl pull <image-name>

6.3 网络调试工具

# 使用 netshoot 镜像
kubectl run netshoot --rm -it --image=nicolaka/netshoot -- bash

# 在容器内执行网络调试
ping <target-ip>
curl <service-url>
nslookup <domain>
dig <domain>
traceroute <target-ip>
tcpdump -i any port 80

6.4 etcd 调试

# 检查 etcd 健康
ETCDCTL_API=3 etcdctl endpoint health \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

# 查看 etcd 成员
ETCDCTL_API=3 etcdctl member list

# 查看 etcd 状态
ETCDCTL_API=3 etcdctl endpoint status --write-out=table

# 查看告警
ETCDCTL_API=3 etcdctl alarm list

️ 七、命令速记

Pod 诊断命令

# 查看 Pod 状态
kubectl get pods -A
kubectl describe pod <pod-name> -n <namespace>

# 查看日志
kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous
kubectl logs <pod-name> -c <container-name> -n <namespace>

# 进入容器
kubectl exec -it <pod-name> -n <namespace> -- /bin/bash

# 查看事件
kubectl get events -n <namespace> --sort-by='.lastTimestamp'

网络诊断命令

# DNS 测试
kubectl run -it --rm debug --image=busybox -- nslookup kubernetes.default

# Service 测试
kubectl run -it --rm debug --image=nicolaka/netshoot -- curl <service-name>

# 查看 Endpoints
kubectl get endpoints -n <namespace>
kubectl get endpointslices -n <namespace>

# NetworkPolicy 查看
kubectl get networkpolicy -A

节点诊断命令

# 查看节点状态
kubectl get nodes
kubectl describe node <node-name>

# 节点维护
kubectl cordon <node-name>
kubectl drain <node-name> --ignore-daemonsets
kubectl uncordon <node-name>

# 资源使用
kubectl top nodes
kubectl top pods -A

存储诊断命令

# 查看存储资源
kubectl get pv,pvc,sc -A

# 查看详情
kubectl describe pvc <pvc-name> -n <namespace>
kubectl describe pv <pv-name>

# 检查 CSI
kubectl get pods -n kube-system | grep csi
kubectl logs -n kube-system -l app=csi-driver

八、面试核心问答

Q1: Pod 一直处于 Pending 状态，如何排查？

答案要点：

查看 kubectl describe pod 的事件
检查节点资源是否充足
检查调度约束（affinity/taint）
验证 PVC 绑定状态
检查镜像拉取状态

Q2: 如何诊断 Service 无法访问的问题？

答案要点：

检查 Service 的 Selector 是否匹配 Pod 标签
验证 Endpoints 是否正确
测试 Pod 的 readinessProbe
检查 NetworkPolicy 策略
验证 kube-proxy 工作状态

Q3: Node NotReady 如何处理？

答案要点：

检查 kubelet 服务状态
查看节点 Conditions
检查容器运行时
验证网络插件
查看系统资源（磁盘/内存）

Q4: 如何使用 kubectl debug 调试 Pod？

答案要点：

创建临时调试容器
复制 Pod 进行调试
使用 ephemeral containers
调试节点问题
使用专业调试镜像

Q5: CrashLoopBackOff 的常见原因和解决方法？

答案要点：

应用启动失败（配置/依赖）
探针配置不当
资源限制过低
命令或参数错误
查看日志确定根因

九、故障案例分析

案例 1：大规模 Pod 无法启动

问题描述：新部署 100 个 Pod，全部 Pending

排查过程：

# 1. 检查事件
kubectl get events -A --sort-by='.lastTimestamp' | grep FailedScheduling

# 2. 发现资源不足
# 输出：0/5 nodes are available: 5 Insufficient cpu.

# 3. 检查节点资源
kubectl top nodes
# 发现所有节点 CPU 使用率 > 90%

# 4. 检查是否有资源浪费
kubectl get pods -A -o custom-columns=NAME:.metadata.name,CPU_REQ:.spec.containers[*].resources.requests.cpu,CPU_USAGE:.status.containerStatuses[*].resources.requests.cpu

解决方案：

调整 Pod 的 CPU requests
启用 Cluster Autoscaler
手动添加节点

案例 2：间歇性网络超时

问题描述：Service 偶尔超时

排查过程：

# 1. 检查 kube-proxy 模式
kubectl get configmap kube-proxy -n kube-system -o yaml | grep mode

# 2. 检查 conntrack 表
conntrack -L | wc -l
# 发现接近上限

# 3. 调整 conntrack 参数
sysctl -w net.netfilter.nf_conntrack_max=524288

解决方案：

增加 conntrack 表大小
切换到 IPVS 模式
优化应用保持连接时间

十、最佳实践

故障预防建议

监控告警
- 部署完整的监控体系
- 设置关键指标告警
- 定期查看日志
- 建立值班机制
资源管理
- 合理设置 Requests/Limits
- 配置 ResourceQuota
- 使用 PDB 保护关键服务
- 定期清理无用资源
高可用设计
- 多副本部署
- 跨可用区分布
- 配置健康检查
- 准备灾备方案
文档化
- 记录常见问题
- 编写 Runbook
- 保持文档更新
- 分享经验教训

排查效率提升

工具准备
- 熟悉 kubectl 命令
- 安装调试工具
- 准备常用脚本
- 配置快捷命令
知识积累
- 理解核心组件
- 学习网络原理
- 掌握调试技巧
- 总结故障模式
协作流程
- 建立升级机制
- 保持沟通透明
- 记录排查过程
- 定期复盘改进

十一、总结

通过本模块学习，你已经掌握了：

Pod 各种异常状态的诊断方法
网络故障排查技能
节点异常处理流程
存储问题诊断技巧
调试工具的专业使用
系统化排查思路
实战案例分析

下一步建议：继续学习 11-运维工具，深入了解 Kubernetes 运维工具链和最佳实践。