05-大规模集群网络架构

概述

大规模AI训练集群(万卡级别)的网络架构设计是AI Infra架构师的核心技能。本章深入讲解超大规模集群的网络设计原则、真实案例分析、以及故障容错机制。

大规模集群网络挑战

规模带来的问题

┌─────────────────────────────────────────────────────────────────────────┐
│                    大规模集群网络挑战                                    │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  1. 带宽需求爆炸                                                         │
│  ═══════════════                                                         │
│                                                                          │
│     单卡显存: 80GB (A100)                                                │
│     梯度大小 (175B模型): ~700GB FP32 / ~350GB FP16                       │
│     AllReduce带宽需求: 2 * 梯度大小                                      │
│                                                                          │
│     10000 卡集群:                                                        │
│     - 模型参数: 175B * 4B = 700GB                                        │
│     - 每step通信量: 1.4TB (Ring AllReduce)                              │
│     - 目标step时间: 1秒                                                  │
│     - 需要带宽: 1.4TB/s ≈ 11.2 Tbps!                                    │
│                                                                          │
│  2. 延迟敏感性                                                           │
│  ═══════════════                                                         │
│                                                                          │
│     同步训练要求:                                                        │
│     - 所有GPU同步等待AllReduce完成                                       │
│     - 任何落后者拖慢整体                                                 │
│     - 10000卡时,尾延迟放大效应严重                                       │
│                                                                          │
│     延迟来源:                                                            │
│     - 网络传输延迟: 2-10 μs                                              │
│     - 协议处理延迟: 1-2 μs                                               │
│     - 软件栈延迟: 10-100 μs                                              │
│     - 排队延迟: 0 - 数ms (拥塞时)                                        │
│                                                                          │
│  3. 故障率增加                                                           │
│  ═══════════════                                                         │
│                                                                          │
│     假设单组件年故障率:                                                  │
│     - GPU: 2%                                                            │
│     - NIC: 0.5%                                                          │
│     - 网线: 0.1%                                                         │
│     - 交换机: 0.5%                                                       │
│                                                                          │
│     10000卡集群:                                                         │
│     - 预期GPU故障: 200个/年 ≈ 每1.8天1个                                 │
│     - 预期NIC故障: 50个/年                                               │
│     - 训练作业可能持续数周到数月                                         │
│     - 必须能容忍故障并恢复                                               │
│                                                                          │
│  4. 拥塞与热点                                                           │
│  ═══════════════                                                         │
│                                                                          │
│     AllReduce流量模式:                                                   │
│     - 所有节点同时发送                                                   │
│     - 形成 incast (多对一) 模式                                          │
│     - 交换机缓冲区压力大                                                 │
│                                                                          │
│     热点形成:                                                            │
│     - 某些交换机承载更多流量                                             │
│     - Fat-Tree上层链路更容易成为瓶颈                                     │
│     - 单点故障影响面大                                                   │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

网络设计原则

┌─────────────────────────────────────────────────────────────────────────┐
│                    大规模集群网络设计原则                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  原则1: 非阻塞 (Non-blocking)                                           │
│  ═══════════════════════════                                             │
│                                                                          │
│  • 定义: 任意通信模式下都能达到线速                                      │
│  • 实现: 1:1 收敛比 (无超售)                                             │
│  • 权衡: 成本高, 可适度超售 (如1.2:1)                                    │
│                                                                          │
│  原则2: 多路径 (Multi-path)                                             │
│  ═════════════════════════                                               │
│                                                                          │
│  • 定义: 任意两点间有多条等价路径                                        │
│  • 好处: 负载均衡, 故障时有备份路径                                      │
│  • 实现: ECMP, 自适应路由                                                │
│                                                                          │
│  原则3: 模块化 (Modularity)                                             │
│  ═══════════════════════════                                             │
│                                                                          │
│  • 定义: 网络由相同单元组成, 可独立扩展                                  │
│  • 好处: 简化管理, 故障隔离, 渐进扩容                                    │
│  • 实现: Pod/Block 架构                                                  │
│                                                                          │
│  原则4: 可观测 (Observable)                                             │
│  ═════════════════════════                                               │
│                                                                          │
│  • 定义: 实时了解网络状态                                                │
│  • 指标: 带宽利用率, 延迟, 丢包, 错误                                    │
│  • 工具: 遥测系统, 流量分析                                              │
│                                                                          │
│  原则5: 快速恢复 (Fast Recovery)                                        │
│  ════════════════════════════════                                        │
│                                                                          │
│  • 故障检测: 毫秒级                                                      │
│  • 路径切换: 亚秒级                                                      │
│  • 作业恢复: 分钟级                                                      │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

业界集群架构案例

NVIDIA SuperPOD 架构

┌─────────────────────────────────────────────────────────────────────────┐
│                    NVIDIA DGX SuperPOD 架构                              │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  基础单元: DGX A100 服务器                                               │
│  ═══════════════════════════                                             │
│                                                                          │
│   ┌─────────────────────────────────────────────────────────────┐       │
│   │                    DGX A100                                  │       │
│   │                                                              │       │
│   │  ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ │
│   │  │GPU0 │ │GPU1 │ │GPU2 │ │GPU3 │ │GPU4 │ │GPU5 │ │GPU6 │ │GPU7 │ │
│   │  │ A100│ │ A100│ │ A100│ │ A100│ │ A100│ │ A100│ │ A100│ │ A100│ │
│   │  └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘ │
│   │     │       │       │       │       │       │       │       │    │
│   │     └───────┴───────┴───────┴───NVSwitch───┴───────┴───────┴────┘│
│   │                        (6x NVSwitch, 4.8TB/s bisection)          │
│   │                                                                   │
│   │  网络接口:                                                        │
│   │  • 8x 200Gbps HDR InfiniBand (1.6Tbps total)                     │
│   │  • 2x 100GbE (管理/存储)                                          │
│   └──────────────────────────────────────────────────────────────────┘
│                                                                          │
│  SuperPOD 构建块 (Scalable Unit - SU)                                   │
│  ═════════════════════════════════════                                   │
│                                                                          │
│   1 SU = 20 DGX A100 = 160 GPU                                          │
│                                                                          │
│         ┌───────────────────────────────────────────────────┐           │
│         │                    InfiniBand Leaf Switches        │           │
│         │              (20x Quantum QM8790, 200G x 40 ports) │           │
│         └─────────────────────────┬─────────────────────────┘           │
│                                   │                                      │
│    ┌──────────────────────────────┼──────────────────────────────┐      │
│    │  │  │  │  │  │  │  │  │  │  │  │  │  │  │  │  │  │  │  │  │  │      │
│    ▼  ▼  ▼  ▼  ▼  ▼  ▼  ▼  ▼  ▼  ▼  ▼  ▼  ▼  ▼  ▼  ▼  ▼  ▼  ▼  ▼      │
│   DGX DGX DGX DGX DGX DGX DGX DGX DGX DGX DGX DGX DGX DGX DGX DGX ...  │
│    0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  ...  │
│                                                                          │
│   每个DGX的8个200G连接到8个不同的Leaf交换机 (Rail-Optimized)             │
│                                                                          │
│  完整 SuperPOD (140 SU)                                                 │
│  ══════════════════════                                                  │
│                                                                          │
│   • 140 SU x 20 DGX = 2800 DGX A100 = 22400 GPU                         │
│   • 网络: 3层 Fat-Tree + Rail-Optimized                                 │
│   • 计算性能: ~700 PFLOPS FP16                                          │
│   • 总网络带宽: 44.8 Tbps                                               │
│                                                                          │
│         ┌─────────────────────────────────────────────────────┐         │
│         │                    Spine Switches                    │         │
│         │           (Quantum 2 QM9700, 64x 400G ports)         │         │
│         └───────────────────────────┬─────────────────────────┘         │
│                                     │                                    │
│      ┌──────────────┬───────────────┼───────────────┬──────────────┐    │
│      │              │               │               │              │    │
│      ▼              ▼               ▼               ▼              ▼    │
│   ┌──────┐      ┌──────┐       ┌──────┐       ┌──────┐       ┌──────┐  │
│   │ SU 0 │      │ SU 1 │       │ SU 2 │  ...  │SU 138│       │SU 139│  │
│   │160GPU│      │160GPU│       │160GPU│       │160GPU│       │160GPU│  │
│   └──────┘      └──────┘       └──────┘       └──────┘       └──────┘  │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Meta RSC (Research SuperCluster)

┌─────────────────────────────────────────────────────────────────────────┐
│                    Meta RSC 网络架构                                     │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  规模: 16,000 A100 GPUs                                                 │
│  峰值性能: ~5 EFLOPS                                                    │
│                                                                          │
│  网络特点:                                                               │
│  ═════════                                                               │
│                                                                          │
│  1. 双网络架构:                                                          │
│     • 前端网络: 以太网, 用于存储和管理                                   │
│     • 后端网络: InfiniBand, 用于GPU通信                                  │
│                                                                          │
│  2. InfiniBand 拓扑:                                                     │
│     • 3层 Fat-Tree                                                       │
│     • 200Gbps HDR                                                        │
│     • 非阻塞设计                                                         │
│                                                                          │
│  3. 存储网络:                                                            │
│     • 高速存储: NVMe over Fabric                                        │
│     • 分布式文件系统: 数据集存储                                         │
│     • Checkpoint 存储: 快速保存/恢复                                     │
│                                                                          │
│  ┌─────────────────────────────────────────────────────────────┐        │
│  │                        网络架构                              │        │
│  │                                                              │        │
│  │   ┌─────────────────────────────────────────────────────┐   │        │
│  │   │              Spine Layer (IB Core)                   │   │        │
│  │   │          QM8790 200G Switches                        │   │        │
│  │   └───────────────────────┬─────────────────────────────┘   │        │
│  │                           │                                  │        │
│  │   ┌───────────────────────┼─────────────────────────────┐   │        │
│  │   │                       │                              │   │        │
│  │   │   ┌───────────────────┴───────────────────┐         │   │        │
│  │   │   │         Aggregation Layer             │         │   │        │
│  │   │   └───────────────────┬───────────────────┘         │   │        │
│  │   │                       │                              │   │        │
│  │   │   ┌───────────────────┴───────────────────┐         │   │        │
│  │   │   │           Leaf Layer                   │         │   │        │
│  │   │   │        (Rail-Optimized)               │         │   │        │
│  │   │   └───────────────────┬───────────────────┘         │   │        │
│  │   │                       │                              │   │        │
│  │   │   ┌─────┬─────┬─────┬─┴───┬─────┬─────┬─────┐      │   │        │
│  │   │   │     │     │     │     │     │     │     │      │   │        │
│  │   │   ▼     ▼     ▼     ▼     ▼     ▼     ▼     ▼      │   │        │
│  │   │  GPU   GPU   GPU   GPU   GPU   GPU   GPU   GPU     │   │        │
│  │   │  Pods  Pods  Pods  Pods  Pods  Pods  Pods  Pods    │   │        │
│  │   │                                                     │   │        │
│  │   │  每 Pod = 多个 Grand Teton (Meta GPU Server)        │   │        │
│  │   │  Grand Teton = 8x A100 + 8x 200G IB                │   │        │
│  │   └─────────────────────────────────────────────────────┘   │        │
│  │                                                              │        │
│  │   ┌─────────────────────────────────────────────────────┐   │        │
│  │   │              Ethernet Storage Network                │   │        │
│  │   │                                                      │   │        │
│  │   │    Storage Spine → Storage Leaf → Storage Servers   │   │        │
│  │   │                                                      │   │        │
│  │   │    • 高速 NVMe 存储                                  │   │        │
│  │   │    • 分布式文件系统                                  │   │        │
│  │   │    • 数据集缓存                                      │   │        │
│  │   └─────────────────────────────────────────────────────┘   │        │
│  └─────────────────────────────────────────────────────────────┘        │
│                                                                          │
│  性能指标:                                                               │
│  ─────────                                                               │
│  • 单卡网络带宽: 200Gbps                                                 │
│  • 集群总带宽: 3.2Tbps                                                   │
│  • AllReduce效率: >90% (大消息)                                          │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Google TPU Pod 网络

┌─────────────────────────────────────────────────────────────────────────┐
│                    Google TPU Pod 网络架构                               │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  TPU v4 Pod: 4096 TPU Chips                                             │
│                                                                          │
│  独特设计: 3D Torus 拓扑                                                 │
│  ══════════════════════                                                  │
│                                                                          │
│  传统方案 (Fat-Tree):                                                    │
│  • 需要大量交换机                                                        │
│  • 布线复杂                                                              │
│  • 成本高                                                                │
│                                                                          │
│  TPU 方案 (3D Torus):                                                    │
│  • 直接连接, 无需交换机                                                  │
│  • 每个TPU有6个高速链接                                                  │
│  • 规则拓扑, 布线简单                                                    │
│                                                                          │
│     3D Torus 示意 (简化为 2D):                                           │
│                                                                          │
│     TPU──TPU──TPU──TPU──┐                                               │
│      │    │    │    │   │                                               │
│     TPU──TPU──TPU──TPU──┤                                               │
│      │    │    │    │   │  环形连接                                      │
│     TPU──TPU──TPU──TPU──┤                                               │
│      │    │    │    │   │                                               │
│     TPU──TPU──TPU──TPU──┘                                               │
│      │    │    │    │                                                   │
│      └────┴────┴────┘                                                   │
│                                                                          │
│  实际TPU v4 Pod:                                                         │
│  • 4096 chips 排列成 4x4x256 3D torus                                   │
│  • 每个chip 6个 ICI (Inter-Chip Interconnect) 链路                      │
│  • 单链路带宽: 4.8 Tbps (aggregate)                                     │
│                                                                          │
│  3D Torus 优势:                                                          │
│  ────────────                                                            │
│  1. 无需交换机: 成本降低, 故障点减少                                     │
│  2. 规则拓扑: 通信模式可预测                                             │
│  3. AllReduce优化: 可以利用拓扑结构                                      │
│  4. 扩展性: 增加维度即可扩展                                             │
│                                                                          │
│  3D Torus 挑战:                                                          │
│  ────────────                                                            │
│  1. 路由复杂: 非最短路径可能更优                                         │
│  2. 故障处理: 需要绕路                                                   │
│  3. 布线: 物理布局约束                                                   │
│                                                                          │
│  TPU Pod AllReduce:                                                      │
│  ─────────────────                                                       │
│  • 利用3D Torus结构                                                      │
│  • 分维度进行Reduce                                                      │
│  • X方向 → Y方向 → Z方向                                                 │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

网络可靠性设计

故障域划分

┌─────────────────────────────────────────────────────────────────────────┐
│                        故障域 (Failure Domain) 设计                      │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  故障域层次:                                                             │
│  ═══════════                                                             │
│                                                                          │
│  Level 1: GPU                                                            │
│  • 影响: 单卡                                                            │
│  • 恢复: 重启进程, 从checkpoint恢复                                      │
│  • MTBF: ~50,000小时                                                     │
│                                                                          │
│  Level 2: Server                                                         │
│  • 影响: 8卡 (典型配置)                                                  │
│  • 恢复: 跳过该节点或使用备用                                            │
│  • 原因: 电源、主板、NVSwitch故障                                        │
│                                                                          │
│  Level 3: Rack (机架)                                                    │
│  • 影响: 4-8台服务器                                                     │
│  • 恢复: 隔离机架, 重调度                                                │
│  • 原因: ToR交换机、PDU故障                                              │
│                                                                          │
│  Level 4: Leaf交换机                                                     │
│  • 影响: 连接到该Leaf的所有服务器                                        │
│  • 恢复: 路由切换到冗余路径                                              │
│  • 设计: 每服务器连接多个Leaf                                            │
│                                                                          │
│  Level 5: Spine交换机                                                    │
│  • 影响: 跨Leaf通信性能下降                                              │
│  • 恢复: ECMP自动分流到其他Spine                                         │
│  • 设计: 多Spine冗余                                                     │
│                                                                          │
│  Level 6: 数据中心/AZ                                                    │
│  • 影响: 整个集群                                                        │
│  • 恢复: 跨DC备份                                                        │
│  • 原因: 电力、冷却、自然灾害                                            │
│                                                                          │
│  ┌─────────────────────────────────────────────────────────────┐        │
│  │                    故障域隔离设计                            │        │
│  │                                                              │        │
│  │                     ┌──────────┐                            │        │
│  │                     │  Spine   │                            │        │
│  │                     │ (冗余)   │                            │        │
│  │                     └────┬─────┘                            │        │
│  │              ┌───────────┼───────────┐                      │        │
│  │              │           │           │                      │        │
│  │         ┌────▼────┐ ┌────▼────┐ ┌────▼────┐                │        │
│  │         │ Leaf A  │ │ Leaf B  │ │ Leaf C  │                │        │
│  │         └────┬────┘ └────┬────┘ └────┬────┘                │        │
│  │              │           │           │                      │        │
│  │     ┌────────┼───┐   ┌───┼────────┐  │                     │        │
│  │     │        │   │   │   │        │  │                     │        │
│  │   ┌─▼─┐   ┌──▼─┐ │   │ ┌─▼──┐  ┌──▼─┐                     │        │
│  │   │S1 │   │ S2 │ └───┼─│ S3 │  │ S4 │                     │        │
│  │   └───┘   └────┘     │ └────┘  └────┘                     │        │
│  │                      │                                      │        │
│  │   每个Server连接多个Leaf, 任一Leaf故障不影响Server           │        │
│  │                                                              │        │
│  └─────────────────────────────────────────────────────────────┘        │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

故障检测与恢复

# 故障检测与恢复机制实现

import torch
import torch.distributed as dist
import time
import threading
from enum import Enum

class NodeState(Enum):
    HEALTHY = "healthy"
    SUSPECT = "suspect"
    FAILED = "failed"
    RECOVERING = "recovering"

class FailureDetector:
    """
    分布式故障检测器
    使用心跳机制检测节点故障
    """

    def __init__(self, rank, world_size, timeout_sec=10):
        self.rank = rank
        self.world_size = world_size
        self.timeout = timeout_sec
        self.last_heartbeat = {}  # rank -> timestamp
        self.node_states = {}     # rank -> NodeState

        # 初始化状态
        for r in range(world_size):
            self.last_heartbeat[r] = time.time()
            self.node_states[r] = NodeState.HEALTHY

        # 启动心跳线程
        self.running = True
        self.heartbeat_thread = threading.Thread(target=self._heartbeat_loop)
        self.heartbeat_thread.start()

    def _heartbeat_loop(self):
        """心跳发送循环"""
        while self.running:
            # 广播心跳
            heartbeat = torch.tensor([self.rank, time.time()])
            dist.all_gather_object(None, heartbeat)  # 实际使用更高效的方式

            # 检查超时
            current_time = time.time()
            for r in range(self.world_size):
                if r == self.rank:
                    continue

                elapsed = current_time - self.last_heartbeat[r]

                if elapsed > self.timeout * 2:
                    self.node_states[r] = NodeState.FAILED
                elif elapsed > self.timeout:
                    self.node_states[r] = NodeState.SUSPECT

            time.sleep(1)  # 1秒发送一次心跳

    def get_failed_ranks(self):
        """获取失败的rank列表"""
        return [r for r, state in self.node_states.items()
                if state == NodeState.FAILED]

    def shutdown(self):
        self.running = False
        self.heartbeat_thread.join()


class ElasticTrainer:
    """
    弹性训练器: 支持节点故障恢复
    """

    def __init__(self, model, optimizer, checkpoint_dir):
        self.model = model
        self.optimizer = optimizer
        self.checkpoint_dir = checkpoint_dir
        self.failure_detector = FailureDetector(
            dist.get_rank(), dist.get_world_size()
        )

        # 注册故障回调
        self._register_failure_handler()

    def _register_failure_handler(self):
        """注册NCCL故障处理"""
        # PyTorch 2.0+ 支持
        # torch.distributed.register_error_handler(self._handle_nccl_error)
        pass

    def _handle_nccl_error(self, error):
        """处理NCCL通信错误"""
        print(f"Rank {dist.get_rank()}: NCCL error detected: {error}")

        # 1. 保存当前checkpoint
        self._save_emergency_checkpoint()

        # 2. 通知其他节点
        self._broadcast_failure()

        # 3. 尝试恢复或退出
        if self._can_recover():
            self._recover()
        else:
            self._graceful_exit()

    def _save_emergency_checkpoint(self):
        """紧急保存checkpoint"""
        checkpoint = {
            'model': self.model.state_dict(),
            'optimizer': self.optimizer.state_dict(),
            'step': self.current_step,
            'timestamp': time.time(),
        }
        path = f"{self.checkpoint_dir}/emergency_rank{dist.get_rank()}.pt"
        torch.save(checkpoint, path)

    def _can_recover(self):
        """检查是否可以恢复"""
        failed_ranks = self.failure_detector.get_failed_ranks()

        # 如果失败节点太多, 无法恢复
        if len(failed_ranks) > self.max_failures:
            return False

        # 检查是否有足够的健康节点
        healthy_count = dist.get_world_size() - len(failed_ranks)
        return healthy_count >= self.min_world_size

    def _recover(self):
        """执行恢复流程"""
        # 1. 重建通信组 (排除失败节点)
        failed_ranks = self.failure_detector.get_failed_ranks()
        healthy_ranks = [r for r in range(dist.get_world_size())
                        if r not in failed_ranks]

        new_group = dist.new_group(healthy_ranks)

        # 2. 加载最新checkpoint
        checkpoint = self._load_latest_checkpoint()
        self.model.load_state_dict(checkpoint['model'])
        self.optimizer.load_state_dict(checkpoint['optimizer'])

        # 3. 调整学习率 (可选, 因为world_size变化)
        self._adjust_learning_rate(len(healthy_ranks))

        # 4. 继续训练
        print(f"Recovery complete. New world size: {len(healthy_ranks)}")

    def train_step(self, batch):
        """带故障处理的训练步骤"""
        try:
            # 正常训练逻辑
            output = self.model(batch)
            loss = self.criterion(output, batch['label'])
            loss.backward()

            # AllReduce梯度
            for param in self.model.parameters():
                if param.grad is not None:
                    dist.all_reduce(param.grad)

            self.optimizer.step()
            self.optimizer.zero_grad()

            self.current_step += 1

            # 定期checkpoint
            if self.current_step % self.checkpoint_interval == 0:
                self._save_checkpoint()

        except RuntimeError as e:
            if "NCCL" in str(e):
                self._handle_nccl_error(e)
            else:
                raise


class CheckpointManager:
    """
    Checkpoint管理器: 支持快速保存和恢复
    """

    def __init__(self, save_dir, max_checkpoints=5):
        self.save_dir = save_dir
        self.max_checkpoints = max_checkpoints
        self.checkpoints = []  # (step, path)

    def save(self, model, optimizer, step, async_save=True):
        """保存checkpoint"""
        checkpoint = {
            'model': model.state_dict(),
            'optimizer': optimizer.state_dict(),
            'step': step,
        }

        path = f"{self.save_dir}/checkpoint_{step}.pt"

        if async_save:
            # 异步保存, 不阻塞训练
            threading.Thread(
                target=self._async_save,
                args=(checkpoint, path)
            ).start()
        else:
            torch.save(checkpoint, path)

        self.checkpoints.append((step, path))
        self._cleanup_old_checkpoints()

    def _async_save(self, checkpoint, path):
        """异步保存实现"""
        # 先保存到临时文件
        temp_path = path + ".tmp"
        torch.save(checkpoint, temp_path)

        # 原子重命名
        import os
        os.rename(temp_path, path)

    def _cleanup_old_checkpoints(self):
        """清理旧的checkpoint"""
        while len(self.checkpoints) > self.max_checkpoints:
            _, old_path = self.checkpoints.pop(0)
            import os
            if os.path.exists(old_path):
                os.remove(old_path)

    def load_latest(self):
        """加载最新的checkpoint"""
        if not self.checkpoints:
            return None

        _, path = self.checkpoints[-1]
        return torch.load(path)

网络冗余设计

┌─────────────────────────────────────────────────────────────────────────┐
│                        网络冗余设计                                      │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  1. 链路冗余                                                             │
│  ═══════════                                                             │
│                                                                          │
│     单链路:                     双链路 (推荐):                           │
│                                                                          │
│     ┌─────┐                    ┌─────┐                                  │
│     │ GPU │                    │ GPU │                                  │
│     └──┬──┘                    └─┬─┬─┘                                  │
│        │                         │ │                                     │
│     ┌──▼──┐                    ┌─▼─▼─┐                                  │
│     │ NIC │                    │ NIC │ (支持bonding)                    │
│     └──┬──┘                    └─┬─┬─┘                                  │
│        │                         │ │                                     │
│     ┌──▼──┐                   ┌──▼─┼──┐                                 │
│     │Switch                   │Switch│Switch│                           │
│     └─────┘                   └──────┴──────┘                           │
│                                                                          │
│     单点故障                   链路故障不影响                            │
│                                                                          │
│  2. 交换机冗余                                                           │
│  ═══════════════                                                         │
│                                                                          │
│     MLAG (Multi-Chassis Link Aggregation):                               │
│                                                                          │
│         ┌──────────────────────────┐                                    │
│         │      Virtual Switch      │                                    │
│         │  ┌─────────┬─────────┐  │                                    │
│         │  │ Switch1 │ Switch2 │  │                                    │
│         │  │ (Active)│(Standby)│  │                                    │
│         │  └────┬────┴────┬────┘  │                                    │
│         └───────┼─────────┼───────┘                                    │
│                 │         │                                              │
│         ┌───────┴────┬────┴───────┐                                    │
│         │            │            │                                     │
│         ▼            ▼            ▼                                     │
│     Server 1    Server 2    Server 3                                    │
│                                                                          │
│     两个物理交换机呈现为一个逻辑交换机                                   │
│     任一交换机故障,另一个接管                                            │
│                                                                          │
│  3. 路径冗余 (ECMP)                                                      │
│  ═════════════════                                                       │
│                                                                          │
│     ┌─────────────────────────────────────────────────┐                 │
│     │                   Spine Layer                    │                 │
│     │   Spine1    Spine2    Spine3    Spine4          │                 │
│     │     ▲         ▲         ▲         ▲             │                 │
│     │     │         │         │         │             │                 │
│     │     └────┬────┴────┬────┴────┬────┘             │                 │
│     │          │         │         │                  │                 │
│     └──────────┼─────────┼─────────┼──────────────────┘                 │
│                │         │         │                                     │
│         ┌──────▼─────────▼─────────▼──────┐                             │
│         │            Leaf Switch          │                             │
│         └─────────────────────────────────┘                             │
│                                                                          │
│     ECMP (Equal-Cost Multi-Path):                                        │
│     • 多条等价路径                                                       │
│     • 负载均衡流量                                                       │
│     • 任一Spine故障, 流量自动切换到其他Spine                             │
│                                                                          │
│  4. 双网络架构                                                           │
│  ═══════════════                                                         │
│                                                                          │
│     ┌─────────────────────────────────────────────────┐                 │
│     │                  GPU Server                      │                 │
│     │                                                  │                 │
│     │   ┌───────────────────────────────────────┐    │                 │
│     │   │               GPUs                     │    │                 │
│     │   └───────────────┬───────────────────────┘    │                 │
│     │                   │                             │                 │
│     │         ┌─────────┴─────────┐                  │                 │
│     │         │                   │                  │                 │
│     │    ┌────▼────┐        ┌────▼────┐             │                 │
│     │    │Primary  │        │Backup   │             │                 │
│     │    │Network  │        │Network  │             │                 │
│     │    │(IB HDR) │        │(IB HDR) │             │                 │
│     │    └────┬────┘        └────┬────┘             │                 │
│     └─────────┼─────────────────┼───────────────────┘                 │
│               │                 │                                       │
│               ▼                 ▼                                       │
│         IB Fabric 1       IB Fabric 2                                  │
│         (Primary)         (Standby)                                    │
│                                                                          │
│     完全独立的两套网络:                                                  │
│     • 正常时使用Primary                                                  │
│     • Primary故障时切换到Backup                                          │
│     • 可以同时使用 (bonding) 增加带宽                                    │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

网络监控与诊断

监控指标体系

# 网络监控指标采集系统

from dataclasses import dataclass
from typing import Dict, List
import time
import subprocess
import re

@dataclass
class NetworkMetrics:
    """网络监控指标"""
    timestamp: float

    # 带宽指标
    tx_bytes: int           # 发送字节数
    rx_bytes: int           # 接收字节数
    tx_rate_gbps: float     # 发送速率 (Gbps)
    rx_rate_gbps: float     # 接收速率 (Gbps)

    # 延迟指标
    latency_avg_us: float   # 平均延迟 (微秒)
    latency_p99_us: float   # P99延迟 (微秒)
    latency_max_us: float   # 最大延迟 (微秒)

    # 错误指标
    tx_errors: int          # 发送错误
    rx_errors: int          # 接收错误
    retransmits: int        # 重传次数
    link_downed: int        # 链路断开次数

    # 拥塞指标
    tx_discards: int        # 发送丢弃
    rx_discards: int        # 接收丢弃
    ecn_marks: int          # ECN标记数

    # 硬件指标
    symbol_errors: int      # 符号错误
    link_recovery: int      # 链路恢复次数


class IBMonitor:
    """InfiniBand 网络监控"""

    def __init__(self, device: str = "mlx5_0"):
        self.device = device
        self.port = 1
        self.prev_metrics = None
        self.prev_time = None

    def collect_metrics(self) -> NetworkMetrics:
        """采集IB指标"""
        current_time = time.time()

        # 读取计数器
        counters = self._read_counters()

        # 计算速率
        if self.prev_metrics:
            dt = current_time - self.prev_time
            tx_rate = (counters['tx_bytes'] - self.prev_metrics.tx_bytes) * 8 / dt / 1e9
            rx_rate = (counters['rx_bytes'] - self.prev_metrics.rx_bytes) * 8 / dt / 1e9
        else:
            tx_rate = rx_rate = 0

        metrics = NetworkMetrics(
            timestamp=current_time,
            tx_bytes=counters['tx_bytes'],
            rx_bytes=counters['rx_bytes'],
            tx_rate_gbps=tx_rate,
            rx_rate_gbps=rx_rate,
            latency_avg_us=self._measure_latency()['avg'],
            latency_p99_us=self._measure_latency()['p99'],
            latency_max_us=self._measure_latency()['max'],
            tx_errors=counters.get('tx_errors', 0),
            rx_errors=counters.get('rx_errors', 0),
            retransmits=counters.get('retransmits', 0),
            link_downed=counters.get('link_downed', 0),
            tx_discards=counters.get('tx_discards', 0),
            rx_discards=counters.get('rx_discards', 0),
            ecn_marks=counters.get('ecn_marks', 0),
            symbol_errors=counters.get('symbol_errors', 0),
            link_recovery=counters.get('link_recovery', 0),
        )

        self.prev_metrics = metrics
        self.prev_time = current_time

        return metrics

    def _read_counters(self) -> Dict:
        """读取IB计数器"""
        counters = {}

        # 使用 perfquery 获取计数器
        try:
            output = subprocess.check_output(
                ['perfquery', '-x', '-d', self.device, '-P', str(self.port)],
                text=True
            )

            # 解析输出
            for line in output.split('\n'):
                if ':' in line:
                    key, value = line.split(':', 1)
                    key = key.strip().lower().replace(' ', '_')
                    try:
                        counters[key] = int(value.strip().split()[0])
                    except:
                        pass

            # 转换为标准名称
            counters['tx_bytes'] = counters.get('xmitdata', 0) * 4
            counters['rx_bytes'] = counters.get('rcvdata', 0) * 4
            counters['tx_errors'] = counters.get('xmiterrors', 0)
            counters['rx_errors'] = counters.get('rcverrors', 0)

        except Exception as e:
            print(f"Error reading counters: {e}")

        return counters

    def _measure_latency(self) -> Dict:
        """测量延迟 (简化实现)"""
        # 实际应该使用 ib_write_lat 或类似工具
        return {
            'avg': 2.0,   # 微秒
            'p99': 5.0,
            'max': 10.0,
        }

    def check_health(self) -> Dict:
        """健康检查"""
        metrics = self.collect_metrics()
        issues = []

        # 检查错误率
        if metrics.tx_errors > 0 or metrics.rx_errors > 0:
            issues.append(f"Errors detected: TX={metrics.tx_errors}, RX={metrics.rx_errors}")

        # 检查链路状态
        if metrics.link_downed > 0:
            issues.append(f"Link downed {metrics.link_downed} times")

        # 检查延迟
        if metrics.latency_p99_us > 10:
            issues.append(f"High latency: P99={metrics.latency_p99_us}us")

        # 检查丢包
        if metrics.tx_discards > 0 or metrics.rx_discards > 0:
            issues.append(f"Discards: TX={metrics.tx_discards}, RX={metrics.rx_discards}")

        return {
            'healthy': len(issues) == 0,
            'issues': issues,
            'metrics': metrics,
        }


class NCCLMonitor:
    """NCCL 通信监控"""

    def __init__(self):
        self.comm_times = []  # 记录通信时间
        self.comm_sizes = []  # 记录通信大小

    def record_allreduce(self, size_bytes: int, time_ms: float):
        """记录AllReduce通信"""
        self.comm_times.append(time_ms)
        self.comm_sizes.append(size_bytes)

    def get_statistics(self) -> Dict:
        """获取通信统计"""
        if not self.comm_times:
            return {}

        import numpy as np
        times = np.array(self.comm_times)
        sizes = np.array(self.comm_sizes)

        # 计算带宽 (考虑AllReduce的2x通信量)
        bandwidths = 2 * sizes / times / 1e6  # GB/s

        return {
            'total_comms': len(self.comm_times),
            'total_bytes': sum(self.comm_sizes),
            'avg_time_ms': np.mean(times),
            'p99_time_ms': np.percentile(times, 99),
            'avg_bandwidth_gbps': np.mean(bandwidths) * 8,
            'min_bandwidth_gbps': np.min(bandwidths) * 8,
        }

    def detect_stragglers(self, world_size: int, times: List[float]) -> List[int]:
        """检测落后者"""
        import numpy as np
        median_time = np.median(times)
        threshold = median_time * 1.5  # 比中位数慢50%以上

        stragglers = [i for i, t in enumerate(times) if t > threshold]
        return stragglers


# 监控仪表板
class NetworkDashboard:
    """网络监控仪表板"""

    def __init__(self, cluster_size: int):
        self.cluster_size = cluster_size
        self.ib_monitors = {}
        self.nccl_monitor = NCCLMonitor()

    def create_summary(self) -> str:
        """生成监控摘要"""
        summary = []
        summary.append("=" * 60)
        summary.append("Network Monitoring Dashboard")
        summary.append("=" * 60)

        # 集群概览
        healthy_nodes = sum(1 for m in self.ib_monitors.values()
                          if m.check_health()['healthy'])
        summary.append(f"\nCluster Health: {healthy_nodes}/{self.cluster_size} nodes healthy")

        # 带宽统计
        nccl_stats = self.nccl_monitor.get_statistics()
        if nccl_stats:
            summary.append(f"\nNCCL Statistics:")
            summary.append(f"  Avg Bandwidth: {nccl_stats['avg_bandwidth_gbps']:.1f} Gbps")
            summary.append(f"  Min Bandwidth: {nccl_stats['min_bandwidth_gbps']:.1f} Gbps")
            summary.append(f"  Avg Time: {nccl_stats['avg_time_ms']:.2f} ms")
            summary.append(f"  P99 Time: {nccl_stats['p99_time_ms']:.2f} ms")

        # 问题节点
        issues = []
        for rank, monitor in self.ib_monitors.items():
            health = monitor.check_health()
            if not health['healthy']:
                issues.append(f"  Rank {rank}: {', '.join(health['issues'])}")

        if issues:
            summary.append(f"\nIssues Detected:")
            summary.extend(issues)

        summary.append("=" * 60)
        return "\n".join(summary)

诊断命令速查

# ============================================================
# InfiniBand 诊断命令
# ============================================================

# 1. 查看IB设备状态
ibstatus
# 输出: 端口状态 (Active/Down), 速率 (200Gbps), 链路层 (IB/Eth)

# 2. 查看设备详情
ibv_devinfo -d mlx5_0 -v

# 3. 查看端口计数器
perfquery -x

# 4. 子网管理器状态
sminfo

# 5. 拓扑发现
ibnetdiscover

# 6. 诊断网络
ibdiagnet -r

# 7. 带宽测试
ib_write_bw -d mlx5_0 -s 1048576 -n 1000  # Server
ib_write_bw -d mlx5_0 -s 1048576 -n 1000 <server_ip>  # Client

# 8. 延迟测试
ib_write_lat -d mlx5_0 -s 8 -n 10000  # Server
ib_write_lat -d mlx5_0 -s 8 -n 10000 <server_ip>  # Client

# ============================================================
# NCCL 诊断
# ============================================================

# 1. 启用调试日志
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL

# 2. 显示拓扑信息
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=GRAPH

# 3. NCCL测试
cd nccl-tests
./all_reduce_perf -b 1M -e 1G -f 2 -g 8

# 4. 指定算法测试
NCCL_ALGO=Ring ./all_reduce_perf -b 256M -e 256M -g 8
NCCL_ALGO=Tree ./all_reduce_perf -b 256M -e 256M -g 8

# ============================================================
# PCIe 诊断
# ============================================================

# 1. GPU拓扑
nvidia-smi topo -m

# 2. PCIe树
lspci -tv

# 3. PCIe链路状态
lspci -vv | grep -E "LnkSta:|Speed"

# 4. GPU PCIe带宽
nvidia-smi dmon -s ut

# ============================================================
# 系统级诊断
# ============================================================

# 1. NUMA拓扑
numactl -H

# 2. CPU亲和性
taskset -p <pid>

# 3. 中断亲和
cat /proc/interrupts | grep mlx5

# 4. 网络统计
netstat -s
ss -s

总结

大规模集群网络设计清单

┌─────────────────────────────────────────────────────────────────────────┐
│                    网络架构设计清单                                      │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  □ 拓扑选择                                                              │
│    □ 集群规模评估                                                        │
│    □ Fat-Tree / Rail-Optimized / Dragonfly 选型                         │
│    □ 收敛比设计 (推荐1:1)                                                │
│                                                                          │
│  □ 硬件选型                                                              │
│    □ 交换机: IB NDR 400G / HDR 200G                                     │
│    □ 网卡: ConnectX-7 / ConnectX-6                                      │
│    □ 线缆: AOC / DAC / 光纤                                              │
│                                                                          │
│  □ 服务器网络设计                                                        │
│    □ GPU-NIC 亲和性 (同一PCIe switch)                                   │
│    □ 多NIC配置 (每GPU一个NIC)                                           │
│    □ NVLink/NVSwitch 配置                                               │
│                                                                          │
│  □ 可靠性设计                                                            │
│    □ 链路冗余                                                            │
│    □ 交换机冗余 (MLAG)                                                  │
│    □ 多路径 (ECMP)                                                      │
│    □ 双网络架构 (可选)                                                  │
│                                                                          │
│  □ 故障处理                                                              │
│    □ 故障检测机制                                                        │
│    □ 自动切换策略                                                        │
│    □ Checkpoint策略                                                      │
│    □ 弹性训练支持                                                        │
│                                                                          │
│  □ 监控告警                                                              │
│    □ 带宽/延迟监控                                                       │
│    □ 错误/丢包监控                                                       │
│    □ 健康检查                                                            │
│    □ 告警阈值设置                                                        │
│                                                                          │
│  □ 性能调优                                                              │
│    □ NCCL参数优化                                                        │
│    □ IB参数优化                                                          │
│    □ NUMA/亲和性优化                                                     │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

面试高频问题

万卡集群的网络架构如何设计?
Fat-Tree和Rail-Optimized的优缺点?
如何处理大规模训练中的节点故障?
ECMP负载均衡的原理和局限?
如何监控和诊断分布式训练的网络问题?
GPUDirect RDMA如何减少通信延迟?
训练过程中遇到NCCL超时如何排查?