故障排查与诊断

线上出问题时的救命工具和方法论

一、故障排查方法论

1.1 排查思路

遇到问题时，按照以下顺序排查：

1. 确认现象
   └── 什么症状？什么时候开始？影响范围？

2. 收集信息
   ├── 系统资源：CPU、内存、磁盘、网络
   ├── 日志信息：系统日志、应用日志
   └── 变更记录：最近有什么变更？

3. 定位问题
   ├── 是否资源瓶颈？
   ├── 是否配置错误？
   ├── 是否代码问题？
   └── 是否外部依赖问题？

4. 解决问题
   ├── 临时方案：重启服务、扩容、降级
   └── 根本方案：修复代码、优化配置

5. 复盘总结
   └── 问题原因、解决过程、预防措施

1.2 快速检查清单

# 1. 系统负载
uptime
top -bn1 | head -20

# 2. 内存使用
free -h
vmstat 1 5

# 3. 磁盘使用
df -h
iostat -x 1 3

# 4. 网络状态
netstat -tunlp
ss -s

# 5. 最近日志
tail -100 /var/log/syslog
journalctl --since "10 minutes ago"

# 6. 进程状态
ps aux --sort=-%cpu | head
ps aux --sort=-%mem | head

# 7. 最近登录
last -10

# 8. 系统事件
dmesg | tail -50

二、系统诊断

2.1 dmesg - 内核消息

dmesg 显示内核环形缓冲区的消息，对于排查硬件和驱动问题非常有用：

# 查看所有消息
dmesg
dmesg | less

# 最近消息
dmesg | tail -50
dmesg -T | tail -50              # 显示时间戳

# 按级别过滤
dmesg --level=err                # 错误
dmesg --level=warn               # 警告
dmesg --level=err,warn           # 错误和警告

# 清除消息（需要 root）
sudo dmesg -c

# 实时监控
dmesg -w                         # 类似 tail -f
dmesg -wT                        # 带时间戳

# 常见问题关键字搜索
dmesg | grep -i error
dmesg | grep -i fail
dmesg | grep -i "out of memory"
dmesg | grep -i "killed process"  # OOM Killer
dmesg | grep -i "I/O error"       # 磁盘错误

常见问题示例

# 1. OOM (Out of Memory) 问题
dmesg | grep -i "out of memory"
dmesg | grep -i "oom"

# 2. 磁盘错误
dmesg | grep -i "I/O error"
dmesg | grep -i "EXT4-fs error"

# 3. 网卡问题
dmesg | grep -i eth0
dmesg | grep -i "link down"

# 4. USB 设备
dmesg | grep -i usb

# 5. CPU 问题
dmesg | grep -i "CPU"
dmesg | grep -i "temperature"

2.2 系统日志分析

# Ubuntu/Debian 系统日志
tail -f /var/log/syslog
grep -i error /var/log/syslog

# CentOS/RHEL 系统日志
tail -f /var/log/messages

# 认证日志（SSH 登录失败等）
tail /var/log/auth.log           # Ubuntu
tail /var/log/secure             # CentOS
grep "Failed password" /var/log/auth.log

# 使用 journalctl
journalctl -p err -b             # 本次启动的错误
journalctl -p err --since today
journalctl -u nginx -p err
journalctl --since "2025-01-20 10:00" --until "2025-01-20 11:00"

2.3 系统资源总览

# 综合查看
htop                             # 推荐
top

# 系统概况
vmstat 1 10
# 输出解释：
# r: 运行队列长度（>CPU数则繁忙）
# b: 阻塞进程数
# si/so: swap in/out（>0 说明内存不足）
# bi/bo: 块设备 I/O
# us/sy/id/wa: CPU 用户态/内核态/空闲/IO等待

# 资源使用概况脚本
echo "=== CPU ==="
top -bn1 | head -5
echo -e "\n=== Memory ==="
free -h
echo -e "\n=== Disk ==="
df -h | grep -v tmpfs
echo -e "\n=== Load ==="
uptime

三、CPU 诊断

3.1 CPU 使用率分析

# 查看 CPU 使用率
top -bn1 | head -20
mpstat 1 5                       # 每秒采样，共5次
mpstat -P ALL 1 5                # 每个 CPU

# CPU 使用率高的进程
ps aux --sort=-%cpu | head -10

# 实时查看
top
# 按 P 键按 CPU 排序
# 按 1 键显示每个 CPU

3.2 CPU 负载分析

# 查看负载
uptime
cat /proc/loadavg

# 负载解读
# 三个数字分别是 1分钟、5分钟、15分钟平均负载
# 负载 = 正在运行 + 等待运行的进程数
# 单核 CPU：负载 1.0 = 100% 使用
# 4核 CPU：负载 4.0 = 100% 使用
# 负载 > CPU 核数 → 过载

# 查看 CPU 核数
nproc
grep -c processor /proc/cpuinfo

3.3 CPU 问题排查

# 1. 找到 CPU 高的进程
top -c                           # 显示完整命令

# 2. 查看进程详情
ps -ef | grep <pid>
cat /proc/<pid>/cmdline

# 3. 查看进程的线程
top -H -p <pid>
ps -T -p <pid>

# 4. 查看进程在做什么
strace -p <pid>                  # 系统调用
perf top -p <pid>                # 热点函数

# 5. 生成火焰图
perf record -g -p <pid>
perf script > out.perf
# 用 FlameGraph 工具生成火焰图

四、内存诊断

4.1 内存使用分析

# 查看内存使用
free -h
# 重要：看 available，而不是 free
# buff/cache 是可回收的

# 详细内存信息
cat /proc/meminfo
cat /proc/meminfo | grep -E "MemTotal|MemFree|MemAvailable|Buffers|Cached|SwapTotal|SwapFree"

# 内存使用趋势
vmstat 1 10
# 关注 si/so（swap in/out），>0 说明内存紧张

# 内存使用高的进程
ps aux --sort=-%mem | head -10

4.2 内存泄漏排查

# 查看进程内存使用
ps aux --sort=-%mem | head -10
top -o %MEM

# 查看进程内存详情
pmap -x <pid>
cat /proc/<pid>/status | grep -E "VmPeak|VmRSS|VmSwap"

# 内存使用随时间变化
while true; do
    ps -o pid,rss,cmd -p <pid>
    sleep 60
done

# 使用 valgrind 检测（需要重启进程）
valgrind --leak-check=full ./program

4.3 OOM Killer 排查

# 查看 OOM 日志
dmesg | grep -i "out of memory"
dmesg | grep -i "killed process"
journalctl -k | grep -i "oom"

# 查看 OOM 分数
cat /proc/<pid>/oom_score
cat /proc/<pid>/oom_score_adj

# 设置进程 OOM 分数（-1000 到 1000）
echo -1000 | sudo tee /proc/<pid>/oom_score_adj  # 不容易被杀
echo 1000 | sudo tee /proc/<pid>/oom_score_adj   # 优先被杀

4.4 Swap 分析

# 查看 swap 使用
free -h
swapon --show
cat /proc/swaps

# 查看哪些进程使用 swap
for pid in /proc/[0-9]*; do
    awk '/VmSwap/{if($2>0)print "'$pid':",$0}' $pid/status 2>/dev/null
done | sort -k2 -nr | head

# 或使用 smem 工具
sudo apt install smem
smem -s swap -r | head

# swap 使用过高的解决方案
# 1. 增加物理内存
# 2. 优化应用内存使用
# 3. 调整 swappiness
sysctl vm.swappiness=10

五、磁盘诊断

5.1 磁盘空间分析

# 查看磁盘使用
df -h
df -h | grep -v tmpfs

# 查看 inode 使用
df -i

# 查找大目录
du -sh /*
du -sh /var/*
du -h --max-depth=1 / 2>/dev/null | sort -hr | head

# 查找大文件
find / -type f -size +100M 2>/dev/null | head
find / -type f -size +1G 2>/dev/null

# 使用 ncdu（推荐）
sudo apt install ncdu
ncdu /

5.2 磁盘 I/O 分析

# 查看磁盘 I/O
iostat -x 1 5
# 关键指标：
# %util: 设备繁忙程度（>70% 可能是瓶颈）
# await: I/O 等待时间（>100ms 较高）
# r/s, w/s: 每秒读写次数
# rkB/s, wkB/s: 每秒读写量

# 查看 I/O 高的进程
sudo iotop -o
sudo iotop -P                    # 显示进程而非线程

# 查看进程的 I/O 统计
cat /proc/<pid>/io

# 磁盘性能测试
# 写性能
dd if=/dev/zero of=testfile bs=1M count=1000 oflag=direct
# 读性能
dd if=testfile of=/dev/null bs=1M count=1000 iflag=direct
rm testfile

5.3 磁盘故障检测

# 查看磁盘健康状态（SMART）
sudo apt install smartmontools
sudo smartctl -H /dev/sda        # 健康状态
sudo smartctl -a /dev/sda        # 详细信息

# 检查文件系统
sudo fsck -n /dev/sda1           # 检查但不修复
# 注意：fsck 需要在卸载状态下运行

# 查看磁盘错误日志
dmesg | grep -i "I/O error"
dmesg | grep -i sda

# 坏块检测
sudo badblocks -v /dev/sda       # 警告：这很慢

5.4 清理磁盘空间

# 清理包管理器缓存
sudo apt clean                   # Debian/Ubuntu
sudo apt autoremove
sudo yum clean all               # CentOS/RHEL

# 清理日志
sudo journalctl --vacuum-time=7d
sudo journalctl --vacuum-size=500M

# 清理旧内核
sudo apt autoremove --purge      # Ubuntu
package-cleanup --oldkernels     # CentOS

# 查找并删除大日志
find /var/log -name "*.gz" -mtime +30 -delete
find /var/log -name "*.log.*" -mtime +30 -delete

# 清空大日志（不删除文件）
> /var/log/large.log

六、网络诊断

6.1 网络连通性

# 基本连通性
ping -c 4 target.com

# 路由追踪
traceroute target.com
tracepath target.com
mtr target.com                   # 更好的工具

# DNS 解析
dig target.com
nslookup target.com
host target.com

# DNS 解析问题排查
dig @8.8.8.8 target.com          # 使用 Google DNS
dig target.com +trace            # 追踪解析过程

6.2 端口和连接

# 查看监听端口
netstat -tunlp
ss -tunlp

# 查看特定端口
netstat -anp | grep :80
ss -anp | grep :80
lsof -i :80

# 查看连接状态
netstat -ant | awk '{print $6}' | sort | uniq -c
ss -s

# 连接状态详解
# ESTABLISHED: 已建立
# TIME_WAIT: 等待关闭（太多说明短连接多）
# CLOSE_WAIT: 等待关闭（太多说明程序没正确关闭连接）

# 查看特定状态的连接
ss -ant state time-wait
ss -ant state close-wait

# 查看连接到特定端口的 IP
netstat -ant | grep :80 | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -rn | head

6.3 网络流量分析

# 查看流量
iftop -i eth0
nload eth0

# 查看网络统计
netstat -s
ss -s

# 查看网络接口统计
ip -s link
cat /proc/net/dev

# 查看网络接口错误
ifconfig eth0 | grep -E "errors|dropped|overruns"

6.4 抓包分析

# tcpdump 基础
sudo tcpdump -i eth0
sudo tcpdump -i eth0 port 80
sudo tcpdump -i eth0 host 192.168.1.100

# 保存抓包
sudo tcpdump -i eth0 -w capture.pcap
sudo tcpdump -i eth0 -c 100 -w capture.pcap  # 抓100个包

# 读取抓包
tcpdump -r capture.pcap
tcpdump -r capture.pcap -A                   # ASCII 显示

# HTTP 请求分析
sudo tcpdump -i eth0 -A -s 0 'port 80 and tcp' | grep -E "GET|POST|Host:|HTTP/"

# 查看 TCP 握手
sudo tcpdump -i eth0 'tcp[tcpflags] & (tcp-syn|tcp-fin) != 0'

6.5 常见网络问题排查

# 1. 无法连接远程服务
ping target.com                  # 检查网络
telnet target.com 80             # 检查端口
curl -v target.com               # 检查 HTTP

# 2. DNS 解析慢
time dig target.com              # 检查解析时间
dig @8.8.8.8 target.com          # 换 DNS 测试

# 3. 连接超时
traceroute target.com            # 检查路由
mtr target.com                   # 检查丢包

# 4. 连接被拒绝
# 检查目标端口是否开放
# 检查防火墙规则
sudo iptables -L -n
sudo ufw status

# 5. 大量 TIME_WAIT
# 调整内核参数
sysctl -w net.ipv4.tcp_tw_reuse=1
sysctl -w net.ipv4.tcp_fin_timeout=30

# 6. 带宽测试
iperf3 -s                        # 服务端
iperf3 -c server_ip              # 客户端

七、服务诊断

7.1 进程状态检查

# 查看服务状态
systemctl status nginx
systemctl status mysql

# 查看服务日志
journalctl -u nginx
journalctl -u nginx -f           # 实时
journalctl -u nginx --since "1 hour ago"

# 查看进程
ps aux | grep nginx
pgrep -a nginx

# 查看进程详情
ps -fp <pid>
cat /proc/<pid>/cmdline
cat /proc/<pid>/environ

# 查看进程打开的文件
lsof -p <pid>
ls -la /proc/<pid>/fd

# 查看进程网络连接
lsof -i -p <pid>
ss -p | grep <pid>

7.2 应用日志分析

# 查看应用日志
tail -f /var/log/nginx/error.log
tail -f /var/log/mysql/error.log

# 搜索错误
grep -i error /var/log/app/app.log
grep -i exception /var/log/app/app.log

# 统计错误数量
grep -c "ERROR" /var/log/app/app.log
grep "ERROR" /var/log/app/app.log | wc -l

# 按时间段分析
awk '$4 >= "[20/Jan/2025:10:00:00" && $4 <= "[20/Jan/2025:11:00:00"' access.log

# 分析访问日志
# 统计 IP 访问量
awk '{print $1}' access.log | sort | uniq -c | sort -rn | head

# 统计 HTTP 状态码
awk '{print $9}' access.log | sort | uniq -c | sort -rn

# 统计请求路径
awk '{print $7}' access.log | sort | uniq -c | sort -rn | head

# 统计每秒请求数
awk '{print $4}' access.log | cut -d: -f2-4 | uniq -c | sort -rn | head

7.3 数据库诊断

# MySQL
mysql -e "SHOW PROCESSLIST"
mysql -e "SHOW STATUS LIKE '%connect%'"
mysql -e "SHOW STATUS LIKE '%thread%'"
mysql -e "SHOW ENGINE INNODB STATUS\G"

# 慢查询
mysql -e "SHOW VARIABLES LIKE 'slow_query%'"
mysql -e "SHOW VARIABLES LIKE 'long_query%'"

# PostgreSQL
psql -c "SELECT * FROM pg_stat_activity"
psql -c "SELECT * FROM pg_stat_database"

# Redis
redis-cli info
redis-cli info clients
redis-cli info memory
redis-cli slowlog get 10

八、综合排查场景

8.1 服务器响应慢

# 1. 检查系统负载
uptime
top -bn1 | head -20

# 2. 检查内存
free -h
vmstat 1 5

# 3. 检查磁盘 I/O
iostat -x 1 5

# 4. 检查网络
netstat -ant | wc -l
ss -s

# 5. 检查应用进程
ps aux --sort=-%cpu | head
ps aux --sort=-%mem | head

# 6. 检查日志
tail -100 /var/log/app/app.log | grep -i error

8.2 服务无法启动

# 1. 查看服务状态
systemctl status myapp

# 2. 查看详细日志
journalctl -u myapp -n 100

# 3. 检查配置文件
# 配置文件语法检查
nginx -t
apache2ctl configtest

# 4. 检查端口占用
netstat -tunlp | grep :80
lsof -i :80

# 5. 检查权限
ls -la /path/to/app
ls -la /var/log/app

# 6. 检查依赖
ldd /path/to/binary            # 检查动态库依赖

8.3 磁盘空间满

# 1. 查看磁盘使用
df -h

# 2. 查找大目录
du -h --max-depth=1 / 2>/dev/null | sort -hr | head

# 3. 查找大文件
find / -type f -size +100M 2>/dev/null

# 4. 查找被删除但占用空间的文件
lsof | grep deleted

# 5. 清理
# 清理日志
sudo journalctl --vacuum-time=3d
# 清理包缓存
sudo apt clean
# 清理旧日志
find /var/log -name "*.gz" -mtime +30 -delete

8.4 网络连接问题

# 1. 检查本地网络
ip addr
ip route

# 2. 检查 DNS
cat /etc/resolv.conf
dig google.com

# 3. 检查连通性
ping 8.8.8.8
ping google.com

# 4. 检查路由
traceroute google.com

# 5. 检查防火墙
sudo iptables -L -n
sudo ufw status

# 6. 检查端口
telnet target 80
nc -zv target 80

九、监控脚本示例

9.1 系统健康检查脚本

#!/bin/bash
# system-health-check.sh

echo "===== System Health Check ====="
echo "Time: $(date)"
echo

echo "=== System Info ==="
uname -a
echo

echo "=== Uptime ==="
uptime
echo

echo "=== CPU Usage (Top 5) ==="
ps aux --sort=-%cpu | head -6
echo

echo "=== Memory Usage ==="
free -h
echo

echo "=== Memory Usage (Top 5) ==="
ps aux --sort=-%mem | head -6
echo

echo "=== Disk Usage ==="
df -h | grep -v tmpfs
echo

echo "=== Network Connections ==="
ss -s
echo

echo "=== Recent Errors ==="
journalctl -p err --since "1 hour ago" | tail -20
echo

echo "=== Failed Services ==="
systemctl list-units --state=failed

9.2 磁盘监控脚本

#!/bin/bash
# disk-monitor.sh

THRESHOLD=80
EMAIL="admin@example.com"

df -h | grep -vE '^Filesystem|tmpfs' | while read line; do
    usage=$(echo $line | awk '{print $5}' | sed 's/%//')
    partition=$(echo $line | awk '{print $6}')

    if [ $usage -ge $THRESHOLD ]; then
        echo "Warning: $partition is ${usage}% full"
        # 发送邮件
        # echo "Disk space warning: $partition is ${usage}% full" | mail -s "Disk Alert" $EMAIL
    fi
done

9.3 服务监控脚本

#!/bin/bash
# service-monitor.sh

SERVICES="nginx mysql redis"

for service in $SERVICES; do
    if ! systemctl is-active --quiet $service; then
        echo "$(date): $service is not running, attempting restart..."
        systemctl restart $service

        if systemctl is-active --quiet $service; then
            echo "$(date): $service restarted successfully"
        else
            echo "$(date): Failed to restart $service"
            # 发送告警
        fi
    fi
done

十、故障排查工具速查表

问题类型	常用命令
CPU 高	`top`, `ps aux --sort=-%cpu`, `perf top`
内存高	`free -h`, `ps aux --sort=-%mem`, `vmstat`
磁盘满	`df -h`, `du -sh`, `ncdu`
磁盘 I/O 高	`iostat -x`, `iotop`, `pidstat -d`
网络问题	`ping`, `traceroute`, `mtr`, `netstat`, `ss`
进程问题	`ps`, `top`, `strace`, `lsof`
服务问题	`systemctl status`, `journalctl -u`
系统日志	`dmesg`, `journalctl`, `/var/log/`
抓包分析	`tcpdump`, `wireshark`, `tshark`

总结

故障排查是后端开发者必备的技能。记住以下要点：

保持冷静：先收集信息，再做判断
从简单到复杂：先检查最常见的问题
看日志：日志是最重要的信息来源
监控为先：有监控能更快发现问题
记录过程：为复盘和知识积累做准备

核心工具：

系统：top, vmstat, dmesg, journalctl
磁盘：df, du, iostat, iotop
网络：netstat, ss, tcpdump, ping, traceroute
进程：ps, strace, lsof

建议定期进行故障演练，熟悉这些工具的使用，这样在真正遇到问题时才能从容应对。