第八部分：命令速查与YAML模板库

快速查询和可复用的配置模板集合

命令速查表

Kubernetes 资源操作

基础操作

命令	功能	示例
`kubectl get pods`	查看 Pod	`kubectl get pods -A -o wide`
`kubectl describe pod <pod>`	查看 Pod 详情	`kubectl describe pod my-pod`
`kubectl logs <pod>`	查看日志	`kubectl logs -f my-pod -c container`
`kubectl exec -it <pod> -- sh`	进入容器	`kubectl exec -it my-pod -- sh`
`kubectl delete pod <pod>`	删除 Pod	`kubectl delete pod my-pod`
`kubectl apply -f <file>`	应用配置	`kubectl apply -f deployment.yaml`
`kubectl get events`	查看事件	`kubectl get events --sort-by=.lastTimestamp`

资源查询

命令	功能	示例
`kubectl get all`	查看所有资源	`kubectl get all -n default`
`kubectl get svc`	查看服务	`kubectl get svc -o wide`
`kubectl get deploy`	查看 Deployment	`kubectl get deploy`
`kubectl get sts`	查看 StatefulSet	`kubectl get sts`
`kubectl get ds`	查看 DaemonSet	`kubectl get ds`
`kubectl get pv`	查看 PV	`kubectl get pv`
`kubectl get pvc`	查看 PVC	`kubectl get pvc`
`kubectl get nodes`	查看节点	`kubectl get nodes -o wide`

资源监控

命令	功能	示例
`kubectl top pods`	Pod 资源使用	`kubectl top pods -A`
`kubectl top nodes`	节点资源使用	`kubectl top nodes`
`kubectl get componentstatuses`	组件状态	`kubectl get cs`
`kubectl cluster-info`	集群信息	`kubectl cluster-info`

网络诊断命令

基础网络测试

命令	功能	示例
`ping <ip>`	测试连通性	`ping 10.244.1.2`
`traceroute <ip>`	查看路径	`traceroute 10.244.1.2`
`nslookup <host>`	DNS 解析	`nslookup kubernetes.default`
`dig <host>`	DNS 详细查询	`dig kubernetes.default.svc.cluster.local`
`curl <url>`	HTTP 请求	`curl http://service`
`telnet <ip> <port>`	端口测试	`telnet 10.244.1.2 80`

网络状态查询

命令	功能	示例
`ip addr show`	查看网卡	`ip addr show`
`ip route show`	查看路由	`ip route show`
`ss -tnlp`	Socket 状态	`ss -tnlp`
`netstat -tulpn`	网络连接	`netstat -tulpn`
`tcpdump -i any`	抓包	`tcpdump -i any port 80`

CNI 特定命令

CNI	命令	功能
Flannel	`ip link show flannel.1`	查看 VXLAN 接口
Calico	`calicoctl node status`	查看 BGP 状态
Cilium	`cilium status`	查看 Cilium 状态
Cilium	`cilium hubble observe`	查看流量

存储管理命令

命令	功能	示例
`kubectl get pv`	查看 PV	`kubectl get pv`
`kubectl get pvc`	查看 PVC	`kubectl get pvc`
`kubectl get sc`	查看 StorageClass	`kubectl get sc`
`df -h`	查看磁盘使用	`df -h`
`lsblk`	查看块设备	`lsblk -f`
`mount`	查看挂载	`mount \| grep data`
`iostat -x 1`	I/O 监控	`iostat -x 1`

性能分析命令

命令	功能	示例
`top`	系统监控	`top`
`htop`	增强监控	`htop`
`vmstat 1`	虚拟内存	`vmstat 1`
`iperf3 -c <server>`	带宽测试	`iperf3 -c server`
`fio`	I/O 测试	`fio --name=test --rw=randread`
`perf top`	性能分析	`perf top`

专项排查命令大全

一、网络问题排查命令

级别	命令	作用	使用技巧
Pod 连通性	`kubectl exec -it <pod> -- ping 8.8.8.8`	检查 Pod 能否出网	可用 -c 5 限制次数，观察丢包率
Service 解析	`kubectl exec -it <pod> -- nslookup kubernetes.default`	检查 DNS 正常性	若失败，检查 CoreDNS
Pod 间通信	`kubectl exec -it pod1 -- curl pod2:8080`	检测 Pod → Pod 通信	若失败，看网络策略 NetworkPolicy
路由表查看	`ip route`	查看宿主机路由	常用于排查 VXLAN 或 CNI 路径问题
网络接口信息	`ip a`	查看 IP 地址、状态	对比 Pod 网卡与 Node 网卡是否对应
MTU 检查	`ip link show`	查看 MTU 配置	Calico/VXLAN 应统一为 1450
抓包分析	`tcpdump -i eth0 port 80 -w dump.pcap`	捕获 TCP 流量	用 Wireshark 打开可看三次握手
链路追踪	`mtr 10.0.0.8`	分析丢包与延迟	比 traceroute 更直观
套接字统计	`ss -tnlp`	查看 TCP 监听与连接	-s 显示重传与延迟统计
NAT 转发	`iptables -t nat -L -n`	检查 SNAT / DNAT 规则	检查 kube-proxy 是否生效
网络性能	`iperf3 -s / iperf3 -c <server>`	测速测试	适合测 Pod ↔ Node 带宽
CoreDNS 状态	`kubectl get pods -n kube-system -l k8s-app=kube-dns`	查看 DNS 服务运行	日志常见错误 "SERVFAIL"

技巧总结：

先 ping，后 curl，再 tcpdump
关注 MTU / 路由 / DNS
若跨节点通信异常 → 看 CNI
若所有 Pod 都无法出网 → 看 NAT/iptables

二、存储问题排查命令

层级	命令	作用	使用技巧
查看 PVC 状态	`kubectl get pvc`	检查是否 Pending	若卡住，说明无匹配 PV
详细描述	`kubectl describe pvc <name>`	查看绑定事件	确认 storageClassName 正确
PV 检查	`kubectl get pv`	查看卷状态与容量	STATUS 应为 Bound
查看 StorageClass	`kubectl get storageclass`	查看默认存储类型	若无 default，PVC 会 Pending
查看挂载点	`mount \| grep nfs`	确认 NFS 挂载
容量统计	`df -h`	显示磁盘使用率	看 /var/lib/containerd 是否满
块设备	`lsblk`	查看磁盘映射	可快速判断挂载关系
IO 性能	`iostat -x 1`	检查 I/O 延迟	await 超 50ms 说明瓶颈
I/O 压测	`fio --rw=randwrite --bs=4k --size=1G --numjobs=4`	测试磁盘 IOPS	随机写比顺序写更能反映瓶颈
打开文件	`lsof \| grep deleted`	查找已删除仍占空间文件
文件系统类型	`df -Th`	确认是否 overlay/nfs/ext4	overlayfs 写放大严重

技巧总结：

PVC Pending → 检查 StorageClass
Pod ContainerCreating → 看挂载路径
I/O 高延迟 → iostat、fio
overlay 占用高 → 容器层清理
大文件占用空间 → lsof | grep deleted

三、大文件与磁盘排查命令

分类	命令	说明	使用技巧
按目录统计	`du -sh /* \| sort -hr \| head -20`	找最大目录
查大文件	`find / -type f -size +500M`	找超过500M的文件	可结合 -mtime 查看时间
查 overlayFS	`du -sh /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/*`	检查容器分层占用	常引发"磁盘满"问题
查看删除但未释放	`lsof \| grep deleted`	查仍占空间文件
系统日志清理	`journalctl --vacuum-time=3d`	删除3天前日志	控制 journal 占用空间
Docker 镜像清理	`docker system prune -af`	清理镜像与缓存	或 nerdctl system prune
容器日志清空	`find /var/lib/docker/containers -name "*.log" -exec truncate -s 0 {} \;`	立即清空日志文件	可定期执行
检查磁盘分区	`df -h / df -i`	查看容量和 inode	inode 爆满文件删不掉
磁盘 I/O 实时查看	`iotop -oPa`	实时显示高 I/O 进程	找"谁在写盘"
PID 进程定位	`ps -ef \| grep <pid>`	确认大文件所属进程

技巧总结：

容器层占用清理三板斧：① 清日志；② 清 overlay；③ 清镜像
inode 用尽比容量满更隐蔽
不要直接删除大文件 → 用 truncate -s 0 或 kill 进程

四、Linux 网络内核调优命令（系统级）

参数	命令	作用	推荐值
TCP backlog	`sysctl -w net.core.somaxconn=65535`	提高连接队列长度	65535
文件描述符	`ulimit -n 1048576`	提高最大打开文件数	1M
TIME_WAIT 重用	`sysctl -w net.ipv4.tcp_tw_reuse=1`	复用 TIME_WAIT 连接	1
SYN backlog	`sysctl -w net.ipv4.tcp_max_syn_backlog=16384`	提高半连接队列	16384
ephemeral ports	`sysctl -w net.ipv4.ip_local_port_range="1024 65535"`	增大可用端口范围	默认即可
内核缓冲区	`sysctl -w net.core.wmem_max=8388608`	提高发送缓冲区	8MB
RCV 缓冲区	`sysctl -w net.core.rmem_max=8388608`	提高接收缓冲区	8MB
TCP keepalive	`sysctl -w net.ipv4.tcp_keepalive_time=60`	降低连接空闲超时	60s

** 技巧**：执行完修改后：

sysctl -p

写入 /etc/sysctl.conf 永久生效。

五、容器与节点性能排查命令

命令	功能	说明
`kubectl top pod`	查看 Pod 资源使用	依赖 metrics-server
`kubectl top node`	查看节点资源使用	CPU/Mem 使用率
`kubectl describe pod`	查看 Pod 状态	常看 "Last State" 是否 OOMKilled
`kubectl logs pod -n ns`	查看日志	支持 --tail、--since
`kubectl get events`	查看近期事件	可排查 Pending、CrashLoop
`ps -eo pid,comm,%cpu,%mem --sort=-%cpu \| head`	查看系统高负载进程
`vmstat 1`	查看系统整体状态	r 表示运行队列长度
`top / htop`	实时性能分析	htop 支持树形展示
`perf top`	内核函数热点分析	调查内核瓶颈
`dmesg -T \| tail -20`	查看最近内核日志

六、脚本模板：一键诊断网络+存储+磁盘

#!/bin/bash
echo "====[ Node Info ]===="
hostnamectl | grep "Operating System"
uptime
echo "====[ Network Test ]===="
ip a | grep inet
ping -c 3 8.8.8.8
echo "====[ Storage Usage ]===="
df -h | grep -v tmpfs
du -sh /var/lib/containerd/*
echo "====[ Deleted Files ]===="
lsof | grep deleted
echo "====[ I/O Summary ]===="
iostat -x 1 3
echo "====[ Top Disk Usage ]===="
du -sh /* | sort -hr | head -10

📍保存为 /usr/local/bin/k8s-diagnose.sh，执行：

bash k8s-diagnose.sh > diagnose.log

七、总结导图

K8s 系统排查命令全景
├── 网络
│   ├── ping / curl / tcpdump / iperf
│   ├── ss / iptables / mtr / nslookup
│   └── sysctl 网络调优
├── 存储
│   ├── df / iostat / fio / lsblk / mount
│   ├── kubectl get pvc/pv/sc
│   └── overlayfs 与 deleted 文件处理
├── 磁盘
│   ├── du / find / lsof / journalctl
│   ├── docker prune / truncate / iotop
│   └── inode 爆满检查
└── 监控
    ├── kubectl top / metrics-server
    ├── Prometheus / Grafana
    └── Alertmanager 告警

** 至此你掌握了**：

所有网络/存储/磁盘故障的排查命令
内核、容器、K8s三层调试技巧
实用脚本 + 命令组合，适合生产环境随手定位

YAML 模板库

1. Deployment + Service 模板

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
  labels:
    app: web
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
      - name: nginx
        image: nginx:alpine
        ports:
        - containerPort: 80
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            memory: 256Mi
        livenessProbe:
          httpGet:
            path: /
            port: 80
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /
            port: 80
          initialDelaySeconds: 5
          periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
  name: web-service
spec:
  selector:
    app: web
  ports:
  - port: 80
    targetPort: 80
  type: ClusterIP

2. StatefulSet + PVC 模板

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: mysql
spec:
  serviceName: mysql
  replicas: 3
  selector:
    matchLabels:
      app: mysql
  template:
    metadata:
      labels:
        app: mysql
    spec:
      containers:
      - name: mysql
        image: mysql:8.0
        ports:
        - containerPort: 3306
        env:
        - name: MYSQL_ROOT_PASSWORD
          valueFrom:
            secretKeyRef:
              name: mysql-secret
              key: password
        volumeMounts:
        - name: data
          mountPath: /var/lib/mysql
        resources:
          requests:
            cpu: 500m
            memory: 1Gi
          limits:
            memory: 2Gi
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: ["ReadWriteOnce"]
      storageClassName: fast-ssd
      resources:
        requests:
          storage: 20Gi

3. DaemonSet 模板

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-exporter
  labels:
    app: node-exporter
spec:
  selector:
    matchLabels:
      app: node-exporter
  template:
    metadata:
      labels:
        app: node-exporter
    spec:
      hostNetwork: true
      hostPID: true
      containers:
      - name: node-exporter
        image: prom/node-exporter
        ports:
        - containerPort: 9100
        volumeMounts:
        - name: proc
          mountPath: /host/proc
          readOnly: true
        - name: sys
          mountPath: /host/sys
          readOnly: true
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            memory: 256Mi
      volumes:
      - name: proc
        hostPath:
          path: /proc
      - name: sys
        hostPath:
          path: /sys

4. Job 模板

apiVersion: batch/v1
kind: Job
metadata:
  name: data-migration
spec:
  template:
    spec:
      containers:
      - name: migrate
        image: migrate:latest
        command: ["sh", "-c"]
        args:
        - |
          echo "Starting migration..."
          # 数据迁移逻辑
          echo "Migration completed"
        resources:
          requests:
            cpu: 500m
            memory: 512Mi
          limits:
            memory: 1Gi
      restartPolicy: OnFailure
  backoffLimit: 3
  completions: 1
  parallelism: 1

5. CronJob 模板

apiVersion: batch/v1
kind: CronJob
metadata:
  name: backup
spec:
  schedule: "0 2 * * *"  # 每天凌晨2点
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: backup
            image: backup:latest
            command: ["sh", "-c"]
            args:
            - |
              echo "Starting backup..."
              # 备份逻辑
              echo "Backup completed"
            volumeMounts:
            - name: backup-storage
              mountPath: /backup
            resources:
              requests:
                cpu: 500m
                memory: 512Mi
          volumes:
          - name: backup-storage
            persistentVolumeClaim:
              claimName: backup-pvc
          restartPolicy: OnFailure
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 1

6. ConfigMap 模板

apiVersion: v1
kind: ConfigMap
metadata:
  name: app-config
data:
  app.conf: |
    # 应用配置
    server_port=8080
    log_level=info
    database_host=mysql
    database_port=3306
  
  nginx.conf: |
    server {
      listen 80;
      server_name _;
      location / {
        proxy_pass http://backend:8080;
      }
    }

7. Secret 模板

apiVersion: v1
kind: Secret
metadata:
  name: app-secret
type: Opaque
stringData:
  database-user: admin
  database-password: "MySecurePassword123!"
  api-key: "sk-xxxxxxxxxxxxx"

8. Ingress 模板

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: web-ingress
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /
    cert-manager.io/cluster-issuer: "letsencrypt-prod"
spec:
  ingressClassName: nginx
  tls:
  - hosts:
    - example.com
    secretName: example-tls
  rules:
  - host: example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: web-service
            port:
              number: 80
      - path: /api
        pathType: Prefix
        backend:
          service:
            name: api-service
            port:
              number: 8080

9. NetworkPolicy 模板

# 允许特定 Pod 访问
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-frontend
spec:
  podSelector:
    matchLabels:
      app: backend
  policyTypes:
  - Ingress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: frontend
    ports:
    - protocol: TCP
      port: 8080
---
# 拒绝所有入站流量
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: deny-all-ingress
spec:
  podSelector: {}
  policyTypes:
  - Ingress
---
# 允许出站到特定服务
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-egress-to-db
spec:
  podSelector:
    matchLabels:
      app: api
  policyTypes:
  - Egress
  egress:
  - to:
    - podSelector:
        matchLabels:
          app: database
    ports:
    - protocol: TCP
      port: 3306

10. HorizontalPodAutoscaler 模板

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
      - type: Percent
        value: 100
        periodSeconds: 15

排障决策树

Pod 启动失败排查流程

Pod 启动失败
    ↓
1. kubectl get pod <pod> -o wide
    ↓
    ├─ Pending
    │   ├─ 资源不足 → kubectl describe node
    │   ├─ PVC 未绑定 → kubectl get pvc
    │   └─ 调度失败 → kubectl describe pod
    │
    ├─ ImagePullBackOff
    │   ├─ 镜像不存在 → 检查镜像名称
    │   ├─ 私有仓库认证 → 检查 imagePullSecrets
    │   └─ 网络问题 → 测试节点网络
    │
    ├─ CrashLoopBackOff
    │   ├─ 应用错误 → kubectl logs <pod>
    │   ├─ 配置错误 → kubectl describe pod
    │   └─ 健康检查失败 → 检查 probe 配置
    │
    └─ Error
        ├─ OOMKilled → 增加 memory limit
        ├─ 退出码非0 → 查看日志
        └─ 挂载失败 → 检查 volume 配置

网络不通排查流程

网络不通
    ↓
1. 确认 Pod 状态
    kubectl get pods -o wide
    ↓
2. 测试 Pod IP 连通性
    kubectl exec <pod-a> -- ping <pod-b-ip>
    ↓
    ├─ 通 → 检查 Service/DNS
    │   ├─ Service 不存在 → kubectl get svc
    │   ├─ Endpoints 为空 → kubectl get ep
    │   └─ DNS 解析失败 → kubectl exec <pod> -- nslookup <svc>
    │
    └─ 不通 → 检查网络层
        ├─ 检查 CNI → kubectl get pods -n kube-system
        ├─ 检查路由 → ip route show
        ├─ 检查防火墙 → iptables -L -n
        └─ 检查 NetworkPolicy → kubectl get networkpolicy

存储挂载失败流程

存储挂载失败
    ↓
1. kubectl describe pod <pod>
    ↓
    ├─ PVC 不存在
    │   └─ kubectl get pvc
    │
    ├─ PV 未绑定
    │   ├─ kubectl get pv
    │   ├─ 检查容量匹配
    │   ├─ 检查 AccessMode
    │   └─ 检查 StorageClass
    │
    ├─ 挂载超时
    │   ├─ 检查 CSI 驱动 → kubectl get pods -n kube-system
    │   ├─ 检查节点存储 → df -h
    │   └─ 检查存储后端 → 如 NFS 服务器
    │
    └─ 权限错误
        └─ 检查 fsGroup/runAsUser

性能问题定位流程

性能问题
    ↓
1. 确定症状
    ├─ 响应慢
    ├─ CPU 高
    ├─ 内存高
    └─ I/O 慢
    ↓
2. 收集指标
    ├─ kubectl top pod
    ├─ kubectl top node
    └─ 应用指标（Prometheus）
    ↓
3. 分层分析
    ├─ 应用层
    │   ├─ 日志分析 → kubectl logs
    │   ├─ Profile → pprof/trace
    │   └─ 代码优化
    │
    ├─ 容器层
    │   ├─ CPU throttling → cat /sys/fs/cgroup/cpu.stat
    │   ├─ 内存 OOM → kubectl describe pod
    │   └─ 资源限制调整
    │
    ├─ 节点层
    │   ├─ 系统监控 → top/htop
    │   ├─ I/O 监控 → iostat
    │   └─ 网络监控 → sar -n DEV
    │
    └─ 集群层
        ├─ 调度延迟 → kubectl get events
        ├─ API Server → kubectl get --raw /metrics
        └─ etcd 性能

快速参考

资源限制建议

应用类型	CPU Request	CPU Limit	Memory Request	Memory Limit
Web 前端	100m	-	128Mi	256Mi
API 服务	500m	-	512Mi	1Gi
数据库	1	2	2Gi	2Gi
批处理	500m	-	1Gi	2Gi
缓存服务	500m	-	1Gi	1Gi

健康检查配置

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 5
  successThreshold: 1
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5
  timeoutSeconds: 3
  successThreshold: 1
  failureThreshold: 3

常用标签

metadata:
  labels:
    app: myapp                    # 应用名称
    version: v1.0.0               # 版本
    env: production               # 环境
    tier: frontend                # 层级
    component: api                # 组件
    managed-by: helm              # 管理工具

常用注解

metadata:
  annotations:
    # Ingress
    nginx.ingress.kubernetes.io/rewrite-target: /
    cert-manager.io/cluster-issuer: letsencrypt
    
    # Service
    service.beta.kubernetes.io/aws-load-balancer-type: nlb
    
    # Pod
    prometheus.io/scrape: "true"
    prometheus.io/port: "9090"
    prometheus.io/path: "/metrics"