HiHuo
首页
博客
手册
工具
关于
首页
博客
手册
工具
关于
  • 技术面试完全指南

    • 技术面试完全指南
    • 8年面试官告诉你:90%的简历在第一轮就被刷掉了
    • 刷了500道LeetCode,终于明白大厂算法面试到底考什么
    • 高频算法题精讲-双指针与滑动窗口
    • 03-高频算法题精讲-二分查找与排序
    • 04-高频算法题精讲-树与递归
    • 05-高频算法题精讲-图与拓扑排序
    • 06-高频算法题精讲-动态规划
    • Go面试必问:一道GMP问题,干掉90%的候选人
    • 08-数据库面试高频题
    • 09-分布式系统面试题
    • 10-Kubernetes与云原生面试题
    • 11-系统设计面试方法论
    • 前端面试高频题
    • AI 与机器学习面试题
    • 行为面试与软技能

深度学习核心

1. 神经网络基础

1.1 前向传播 (Forward Propagation)

前向传播是神经网络计算输出的过程,数据从输入层逐层传递到输出层。

单个神经元计算:

z = w·x + b
a = σ(z)

多层网络:

z[1] = W[1]·x + b[1]
a[1] = σ(z[1])

z[2] = W[2]·a[1] + b[2]
a[2] = σ(z[2])
...

PyTorch实现:

import torch
import torch.nn as nn

class SimpleNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(SimpleNN, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        # 前向传播
        z1 = self.fc1(x)           # 线性变换
        a1 = self.relu(z1)         # 激活函数
        z2 = self.fc2(a1)          # 输出层
        return z2

# 使用示例
model = SimpleNN(input_size=10, hidden_size=20, output_size=2)
x = torch.randn(32, 10)  # batch_size=32
output = model(x)
print(f"输出形状: {output.shape}")  # [32, 2]

1.2 反向传播 (Backpropagation)

反向传播通过链式法则计算梯度,从输出层反向传递到输入层。

核心公式(链式法则):

∂L/∂w = ∂L/∂a × ∂a/∂z × ∂z/∂w

详细推导:

对于最后一层:

∂L/∂z[L] = a[L] - y  (使用交叉熵损失)
∂L/∂W[L] = (∂L/∂z[L]) · a[L-1]ᵀ
∂L/∂b[L] = ∂L/∂z[L]

对于中间层:

∂L/∂z[l] = (W[l+1]ᵀ · ∂L/∂z[l+1]) ⊙ σ'(z[l])
∂L/∂W[l] = (∂L/∂z[l]) · a[l-1]ᵀ
∂L/∂b[l] = ∂L/∂z[l]

手动实现示例:

import numpy as np

class NeuralNetwork:
    def __init__(self, layer_sizes):
        self.weights = []
        self.biases = []

        for i in range(len(layer_sizes) - 1):
            w = np.random.randn(layer_sizes[i+1], layer_sizes[i]) * 0.01
            b = np.zeros((layer_sizes[i+1], 1))
            self.weights.append(w)
            self.biases.append(b)

    def sigmoid(self, z):
        return 1 / (1 + np.exp(-z))

    def sigmoid_derivative(self, z):
        s = self.sigmoid(z)
        return s * (1 - s)

    def forward(self, X):
        self.z_values = []
        self.activations = [X]

        for w, b in zip(self.weights, self.biases):
            z = np.dot(w, self.activations[-1]) + b
            a = self.sigmoid(z)
            self.z_values.append(z)
            self.activations.append(a)

        return self.activations[-1]

    def backward(self, X, y, learning_rate=0.01):
        m = X.shape[1]

        # 输出层梯度
        dz = self.activations[-1] - y

        # 反向传播
        for i in reversed(range(len(self.weights))):
            dw = (1/m) * np.dot(dz, self.activations[i].T)
            db = (1/m) * np.sum(dz, axis=1, keepdims=True)

            if i > 0:
                dz = np.dot(self.weights[i].T, dz) * self.sigmoid_derivative(self.z_values[i-1])

            # 更新参数
            self.weights[i] -= learning_rate * dw
            self.biases[i] -= learning_rate * db

1.3 梯度下降

梯度下降是优化神经网络参数的核心算法。

参数更新公式:

w := w - α·∂L/∂w

批量梯度下降(BGD):

for epoch in range(num_epochs):
    # 使用全部数据计算梯度
    gradients = compute_gradients(X_train, y_train)
    weights -= learning_rate * gradients

随机梯度下降(SGD):

for epoch in range(num_epochs):
    for i in range(n_samples):
        # 每次使用一个样本
        gradients = compute_gradients(X_train[i], y_train[i])
        weights -= learning_rate * gradients

小批量梯度下降(Mini-batch GD):

for epoch in range(num_epochs):
    for batch in get_batches(X_train, y_train, batch_size):
        X_batch, y_batch = batch
        gradients = compute_gradients(X_batch, y_batch)
        weights -= learning_rate * gradients

1.4 激活函数

激活函数引入非线性,使神经网络能够学习复杂模式。

激活函数公式导数优点缺点
Sigmoidσ(x)=1/(1+e^(-x))σ'(x)=σ(x)(1-σ(x))输出(0,1),易理解梯度消失,计算慢
Tanhtanh(x)=(e^x-e^(-x))/(e^x+e^(-x))1-tanh²(x)输出(-1,1),零中心梯度消失
ReLUmax(0,x)x>0:1, x≤0:0计算快,缓解梯度消失Dead ReLU问题
Leaky ReLUmax(0.01x,x)x>0:1, x≤0:0.01解决Dead ReLU需调参
GELUx·Φ(x)复杂Transformer常用计算复杂

PyTorch实现:

import torch.nn as nn

# 不同激活函数对比
activations = {
    'ReLU': nn.ReLU(),
    'Sigmoid': nn.Sigmoid(),
    'Tanh': nn.Tanh(),
    'LeakyReLU': nn.LeakyReLU(0.01),
    'GELU': nn.GELU()
}

x = torch.linspace(-5, 5, 100)
for name, activation in activations.items():
    y = activation(x)
    print(f"{name}: min={y.min():.2f}, max={y.max():.2f}")

2. 优化器

2.1 SGD (Stochastic Gradient Descent)

基础的随机梯度下降:

w := w - α·∇L(w)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

2.2 Momentum

引入动量,加速收敛:

v := β·v + ∇L(w)
w := w - α·v
optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

2.3 AdaGrad

自适应调整每个参数的学习率:

G := G + (∇L(w))²
w := w - α/(√G + ε) · ∇L(w)

问题: 学习率单调递减,可能过早停止学习。

2.4 Adam (推荐)

结合Momentum和RMSprop的优点:

# 一阶矩估计(Momentum)
m := β₁·m + (1-β₁)·∇L(w)

# 二阶矩估计(RMSprop)
v := β₂·v + (1-β₂)·(∇L(w))²

# 偏差修正
m̂ := m / (1-β₁ᵗ)
v̂ := v / (1-β₂ᵗ)

# 参数更新
w := w - α·m̂ / (√v̂ + ε)

PyTorch实现:

optimizer = torch.optim.Adam(
    model.parameters(),
    lr=0.001,           # 学习率
    betas=(0.9, 0.999), # β₁, β₂
    eps=1e-8,           # 防止除零
    weight_decay=0.01   # L2正则化
)

优化器对比:

# 完整训练示例
import torch.optim as optim

def train_with_optimizer(model, optimizer_name):
    if optimizer_name == 'SGD':
        optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
    elif optimizer_name == 'Adam':
        optimizer = optim.Adam(model.parameters(), lr=0.001)
    elif optimizer_name == 'AdamW':
        optimizer = optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)

    for epoch in range(num_epochs):
        for batch_x, batch_y in dataloader:
            optimizer.zero_grad()       # 清空梯度
            output = model(batch_x)     # 前向传播
            loss = criterion(output, batch_y)
            loss.backward()             # 反向传播
            optimizer.step()            # 更新参数

3. CNN (卷积神经网络)

3.1 卷积层

卷积操作提取局部特征:

计算公式:

输出尺寸 = (输入尺寸 - 卷积核尺寸 + 2×填充) / 步长 + 1

PyTorch实现:

class CNN(nn.Module):
    def __init__(self):
        super(CNN, self).__init__()
        # 卷积层: in_channels, out_channels, kernel_size
        self.conv1 = nn.Conv2d(3, 32, kernel_size=3, stride=1, padding=1)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, stride=1, padding=1)
        self.pool = nn.MaxPool2d(2, 2)  # 池化层
        self.fc = nn.Linear(64 * 8 * 8, 10)

    def forward(self, x):
        # x: [batch, 3, 32, 32]
        x = self.pool(torch.relu(self.conv1(x)))  # [batch, 32, 16, 16]
        x = self.pool(torch.relu(self.conv2(x)))  # [batch, 64, 8, 8]
        x = x.view(x.size(0), -1)                 # 展平
        x = self.fc(x)                            # 全连接层
        return x

3.2 池化层

降低特征维度,保持重要特征:

  • 最大池化(Max Pooling): 取窗口内最大值
  • 平均池化(Average Pooling): 取窗口内平均值
# 最大池化
max_pool = nn.MaxPool2d(kernel_size=2, stride=2)

# 平均池化
avg_pool = nn.AvgPool2d(kernel_size=2, stride=2)

# 自适应池化(输出固定尺寸)
adaptive_pool = nn.AdaptiveAvgPool2d((1, 1))

3.3 经典CNN网络

LeNet-5 (1998)

最早的CNN网络,用于手写数字识别:

class LeNet5(nn.Module):
    def __init__(self):
        super(LeNet5, self).__init__()
        self.conv1 = nn.Conv2d(1, 6, 5)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16*4*4, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = torch.relu(self.conv1(x))
        x = torch.max_pool2d(x, 2)
        x = torch.relu(self.conv2(x))
        x = torch.max_pool2d(x, 2)
        x = x.view(-1, 16*4*4)
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = self.fc3(x)
        return x

AlexNet (2012)

ImageNet冠军,深度学习崛起标志:

  • 8层网络(5卷积+3全连接)
  • 使用ReLU激活函数
  • Dropout防止过拟合
  • 数据增强

VGG (2014)

使用小卷积核(3×3)堆叠:

class VGG16(nn.Module):
    def __init__(self, num_classes=1000):
        super(VGG16, self).__init__()
        self.features = nn.Sequential(
            # Block 1
            nn.Conv2d(3, 64, 3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(64, 64, 3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),

            # Block 2
            nn.Conv2d(64, 128, 3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(128, 128, 3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),
            # ... 更多层
        )

ResNet (2015)

引入残差连接,解决梯度消失:

class ResidualBlock(nn.Module):
    def __init__(self, in_channels, out_channels, stride=1):
        super(ResidualBlock, self).__init__()
        self.conv1 = nn.Conv2d(in_channels, out_channels, 3, stride, 1)
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.conv2 = nn.Conv2d(out_channels, out_channels, 3, 1, 1)
        self.bn2 = nn.BatchNorm2d(out_channels)

        # 跳跃连接
        self.shortcut = nn.Sequential()
        if stride != 1 or in_channels != out_channels:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_channels, out_channels, 1, stride),
                nn.BatchNorm2d(out_channels)
            )

    def forward(self, x):
        residual = x
        out = torch.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        out += self.shortcut(residual)  # 残差连接
        out = torch.relu(out)
        return out

4. RNN/LSTM/GRU

4.1 RNN (循环神经网络)

处理序列数据,具有记忆能力:

前向传播:

h_t = tanh(W_hh·h_{t-1} + W_xh·x_t + b_h)
y_t = W_hy·h_t + b_y

问题:

  • 梯度消失: 长序列时梯度趋近0
  • 梯度爆炸: 梯度指数增长

4.2 LSTM (长短期记忆网络)

通过门控机制解决长期依赖问题:

三个门:

遗忘门: f_t = σ(W_f·[h_{t-1}, x_t] + b_f)
输入门: i_t = σ(W_i·[h_{t-1}, x_t] + b_i)
输出门: o_t = σ(W_o·[h_{t-1}, x_t] + b_o)

候选值: C̃_t = tanh(W_C·[h_{t-1}, x_t] + b_C)
细胞状态: C_t = f_t ⊙ C_{t-1} + i_t ⊙ C̃_t
隐藏状态: h_t = o_t ⊙ tanh(C_t)

PyTorch实现:

class LSTMModel(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, output_size):
        super(LSTMModel, self).__init__()
        self.lstm = nn.LSTM(
            input_size=input_size,
            hidden_size=hidden_size,
            num_layers=num_layers,
            batch_first=True,
            dropout=0.2
        )
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        # x: [batch, seq_len, input_size]
        lstm_out, (h_n, c_n) = self.lstm(x)
        # lstm_out: [batch, seq_len, hidden_size]
        # 取最后一个时间步的输出
        output = self.fc(lstm_out[:, -1, :])
        return output

# 使用示例
model = LSTMModel(input_size=10, hidden_size=64, num_layers=2, output_size=5)
x = torch.randn(32, 20, 10)  # [batch, seq_len, features]
output = model(x)
print(f"输出形状: {output.shape}")  # [32, 5]

4.3 GRU (门控循环单元)

LSTM的简化版本,参数更少:

两个门:

重置门: r_t = σ(W_r·[h_{t-1}, x_t] + b_r)
更新门: z_t = σ(W_z·[h_{t-1}, x_t] + b_z)

候选隐藏状态: h̃_t = tanh(W·[r_t ⊙ h_{t-1}, x_t] + b)
隐藏状态: h_t = (1 - z_t) ⊙ h_{t-1} + z_t ⊙ h̃_t

LSTM vs GRU对比:

特性LSTMGRU
门数量3个(遗忘/输入/输出)2个(重置/更新)
参数量更多更少
计算速度较慢较快
表达能力更强稍弱
适用场景复杂长序列简单序列,资源受限

5. Transformer

5.1 Self-Attention

Attention机制的核心公式:

Attention(Q, K, V) = softmax(QK^T / √d_k)V
  • Q(Query): 查询向量
  • K(Key): 键向量
  • V(Value): 值向量
  • d_k: Key维度(用于缩放)

PyTorch实现:

import torch.nn.functional as F

class SelfAttention(nn.Module):
    def __init__(self, embed_dim):
        super(SelfAttention, self).__init__()
        self.embed_dim = embed_dim
        self.query = nn.Linear(embed_dim, embed_dim)
        self.key = nn.Linear(embed_dim, embed_dim)
        self.value = nn.Linear(embed_dim, embed_dim)

    def forward(self, x):
        # x: [batch, seq_len, embed_dim]
        Q = self.query(x)  # [batch, seq_len, embed_dim]
        K = self.key(x)
        V = self.value(x)

        # 计算注意力分数
        scores = torch.matmul(Q, K.transpose(-2, -1)) / (self.embed_dim ** 0.5)
        # scores: [batch, seq_len, seq_len]

        # Softmax归一化
        attention_weights = F.softmax(scores, dim=-1)

        # 加权求和
        output = torch.matmul(attention_weights, V)
        return output, attention_weights

5.2 Multi-Head Attention

并行多个Attention,捕捉不同子空间的信息:

class MultiHeadAttention(nn.Module):
    def __init__(self, embed_dim, num_heads):
        super(MultiHeadAttention, self).__init__()
        assert embed_dim % num_heads == 0

        self.embed_dim = embed_dim
        self.num_heads = num_heads
        self.head_dim = embed_dim // num_heads

        self.qkv_proj = nn.Linear(embed_dim, embed_dim * 3)
        self.out_proj = nn.Linear(embed_dim, embed_dim)

    def forward(self, x):
        batch_size, seq_len, embed_dim = x.shape

        # 线性投影并分割成多头
        qkv = self.qkv_proj(x)  # [batch, seq_len, embed_dim*3]
        qkv = qkv.reshape(batch_size, seq_len, self.num_heads, 3*self.head_dim)
        qkv = qkv.permute(0, 2, 1, 3)  # [batch, num_heads, seq_len, 3*head_dim]

        q, k, v = qkv.chunk(3, dim=-1)

        # 缩放点积注意力
        scores = torch.matmul(q, k.transpose(-2, -1)) / (self.head_dim ** 0.5)
        attention = F.softmax(scores, dim=-1)
        out = torch.matmul(attention, v)

        # 合并多头
        out = out.permute(0, 2, 1, 3).reshape(batch_size, seq_len, embed_dim)
        out = self.out_proj(out)
        return out

5.3 位置编码

Transformer无法感知序列顺序,需要添加位置信息:

正弦位置编码:

PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super(PositionalEncoding, self).__init__()

        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() *
                           (-np.log(10000.0) / d_model))

        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)  # [1, max_len, d_model]

        self.register_buffer('pe', pe)

    def forward(self, x):
        # x: [batch, seq_len, d_model]
        x = x + self.pe[:, :x.size(1), :]
        return x

5.4 Transformer vs RNN/CNN

特性TransformerRNN/LSTMCNN
并行化高(可并行)低(顺序处理)高
长距离依赖直接建模梯度消失需多层堆叠
计算复杂度O(n²·d)O(n·d²)O(n·d²·k)
内存消耗高低中
适用场景NLP、多模态时序数据图像、局部特征

6. 高频面试题

面试题1: 解释梯度消失和梯度爆炸,如何解决?

答案:

见第1章面试题4的详细解答。补充Transformer如何解决:

Transformer的优势:

  1. 无梯度消失: 每个位置直接通过Self-Attention连接,梯度路径最短
  2. 残差连接: 每个子层都有跳跃连接
  3. 层归一化: Layer Normalization稳定训练
class TransformerBlock(nn.Module):
    def __init__(self, embed_dim, num_heads):
        super().__init__()
        self.attention = MultiHeadAttention(embed_dim, num_heads)
        self.norm1 = nn.LayerNorm(embed_dim)
        self.ffn = nn.Sequential(
            nn.Linear(embed_dim, 4*embed_dim),
            nn.ReLU(),
            nn.Linear(4*embed_dim, embed_dim)
        )
        self.norm2 = nn.LayerNorm(embed_dim)

    def forward(self, x):
        # 残差连接 + LayerNorm
        x = x + self.attention(self.norm1(x))
        x = x + self.ffn(self.norm2(x))
        return x

面试题2: BatchNorm和LayerNorm的区别?

答案:

Batch Normalization:

  • 沿batch维度归一化
  • 公式: x̂ = (x - μ_batch) / √(σ²_batch + ε)
  • 训练和推理行为不同(需保存running mean/var)
  • 适用于CNN,batch size较大时

Layer Normalization:

  • 沿特征维度归一化
  • 公式: x̂ = (x - μ_layer) / √(σ²_layer + ε)
  • 训练和推理行为一致
  • 适用于RNN/Transformer,batch size较小时

对比表格:

特性BatchNormLayerNorm
归一化维度Batch维度Feature维度
适用模型CNNRNN/Transformer
Batch依赖是否
训练/推理不同相同
小batch不稳定稳定

代码示例:

# BatchNorm (CNN)
bn = nn.BatchNorm2d(num_features=64)
x = torch.randn(32, 64, 28, 28)  # [batch, channels, H, W]
out = bn(x)  # 沿batch和空间维度归一化

# LayerNorm (Transformer)
ln = nn.LayerNorm(normalized_shape=512)
x = torch.randn(32, 100, 512)  # [batch, seq_len, embed_dim]
out = ln(x)  # 沿embed_dim归一化

面试题3: 1×1卷积的作用?

答案:

1×1卷积虽然无法提取空间特征,但有重要作用:

1. 降维/升维

# 降维: 减少计算量
conv1x1 = nn.Conv2d(in_channels=256, out_channels=64, kernel_size=1)
# 输入: [batch, 256, H, W] → 输出: [batch, 64, H, W]

2. 增加非线性

nn.Sequential(
    nn.Conv2d(64, 64, 1),
    nn.ReLU(),  # 增加非线性变换
)

3. 跨通道信息融合

4. Inception模块中的应用:

class InceptionModule(nn.Module):
    def __init__(self, in_channels):
        super().__init__()
        # 1x1卷积降维
        self.branch1x1 = nn.Conv2d(in_channels, 64, 1)

        self.branch3x3 = nn.Sequential(
            nn.Conv2d(in_channels, 96, 1),  # 降维
            nn.Conv2d(96, 128, 3, padding=1)
        )

        self.branch5x5 = nn.Sequential(
            nn.Conv2d(in_channels, 16, 1),  # 降维
            nn.Conv2d(16, 32, 5, padding=2)
        )

面试题4: Attention机制的时间复杂度和优化方法?

答案:

标准Attention复杂度:

时间: O(n²·d)  (n: 序列长度, d: 维度)
空间: O(n²)    (存储attention矩阵)

当序列很长时(如n=10000),复杂度过高。

优化方法:

1. Sparse Attention (稀疏注意力)

只关注局部窗口或特定模式
复杂度: O(n·√n·d) 或 O(n·log(n)·d)

2. Linformer

将K、V投影到低维
复杂度: O(n·k·d)  (k << n)

3. Performer (线性Attention)

使用核方法近似softmax
复杂度: O(n·d²)

4. Flash Attention

# PyTorch 2.0+内置
from torch.nn.functional import scaled_dot_product_attention

output = scaled_dot_product_attention(
    query, key, value,
    attn_mask=None,
    dropout_p=0.0,
    is_causal=False
)
# 通过IO优化和kernel融合加速

面试题5: 如何选择CNN、RNN还是Transformer?

答案:

决策树:

数据类型?
├── 图像
│   ├── 局部特征重要 → CNN (ResNet, EfficientNet)
│   ├── 全局关系重要 → Vision Transformer (ViT)
│   └── 混合 → ConvNeXt, Swin Transformer
│
├── 文本/序列
│   ├── 短序列(<100) → LSTM/GRU
│   ├── 长序列 → Transformer
│   ├── 流式处理 → RNN/LSTM
│   └── 离线处理 → Transformer (BERT, GPT)
│
├── 时序数据
│   ├── 单变量 → LSTM
│   ├── 多变量 → Transformer
│   └── 需要预测未来 → Informer, Autoformer
│
└── 多模态
    └── Transformer (CLIP, Flamingo)

选择建议:

场景推荐模型理由
图像分类CNN/ViTCNN效率高,ViT大数据下效果好
目标检测CNN (YOLO, Faster R-CNN)局部特征重要
机器翻译Transformer并行化,长距离依赖
语音识别Transformer (Whisper)端到端,效果最好
时序预测LSTM/TransformerLSTM简单,Transformer效果好
视频理解3D CNN/Video Transformer时空特征建模

代码示例(混合模型):

class HybridModel(nn.Module):
    """CNN提取特征 + Transformer建模关系"""
    def __init__(self):
        super().__init__()
        # CNN backbone
        self.cnn = nn.Sequential(
            nn.Conv2d(3, 64, 3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),
            # ...
        )
        # Transformer encoder
        self.transformer = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(d_model=512, nhead=8),
            num_layers=6
        )
        self.fc = nn.Linear(512, num_classes)

    def forward(self, x):
        # CNN提取局部特征
        features = self.cnn(x)  # [batch, channels, H, W]

        # 展平为序列
        features = features.flatten(2).permute(0, 2, 1)  # [batch, H*W, channels]

        # Transformer建模全局关系
        features = self.transformer(features)

        # 分类
        output = self.fc(features.mean(dim=1))
        return output