深度学习核心
1. 神经网络基础
1.1 前向传播 (Forward Propagation)
前向传播是神经网络计算输出的过程,数据从输入层逐层传递到输出层。
单个神经元计算:
z = w·x + b
a = σ(z)
多层网络:
z[1] = W[1]·x + b[1]
a[1] = σ(z[1])
z[2] = W[2]·a[1] + b[2]
a[2] = σ(z[2])
...
PyTorch实现:
import torch
import torch.nn as nn
class SimpleNN(nn.Module):
def __init__(self, input_size, hidden_size, output_size):
super(SimpleNN, self).__init__()
self.fc1 = nn.Linear(input_size, hidden_size)
self.relu = nn.ReLU()
self.fc2 = nn.Linear(hidden_size, output_size)
def forward(self, x):
# 前向传播
z1 = self.fc1(x) # 线性变换
a1 = self.relu(z1) # 激活函数
z2 = self.fc2(a1) # 输出层
return z2
# 使用示例
model = SimpleNN(input_size=10, hidden_size=20, output_size=2)
x = torch.randn(32, 10) # batch_size=32
output = model(x)
print(f"输出形状: {output.shape}") # [32, 2]
1.2 反向传播 (Backpropagation)
反向传播通过链式法则计算梯度,从输出层反向传递到输入层。
核心公式(链式法则):
∂L/∂w = ∂L/∂a × ∂a/∂z × ∂z/∂w
详细推导:
对于最后一层:
∂L/∂z[L] = a[L] - y (使用交叉熵损失)
∂L/∂W[L] = (∂L/∂z[L]) · a[L-1]ᵀ
∂L/∂b[L] = ∂L/∂z[L]
对于中间层:
∂L/∂z[l] = (W[l+1]ᵀ · ∂L/∂z[l+1]) ⊙ σ'(z[l])
∂L/∂W[l] = (∂L/∂z[l]) · a[l-1]ᵀ
∂L/∂b[l] = ∂L/∂z[l]
手动实现示例:
import numpy as np
class NeuralNetwork:
def __init__(self, layer_sizes):
self.weights = []
self.biases = []
for i in range(len(layer_sizes) - 1):
w = np.random.randn(layer_sizes[i+1], layer_sizes[i]) * 0.01
b = np.zeros((layer_sizes[i+1], 1))
self.weights.append(w)
self.biases.append(b)
def sigmoid(self, z):
return 1 / (1 + np.exp(-z))
def sigmoid_derivative(self, z):
s = self.sigmoid(z)
return s * (1 - s)
def forward(self, X):
self.z_values = []
self.activations = [X]
for w, b in zip(self.weights, self.biases):
z = np.dot(w, self.activations[-1]) + b
a = self.sigmoid(z)
self.z_values.append(z)
self.activations.append(a)
return self.activations[-1]
def backward(self, X, y, learning_rate=0.01):
m = X.shape[1]
# 输出层梯度
dz = self.activations[-1] - y
# 反向传播
for i in reversed(range(len(self.weights))):
dw = (1/m) * np.dot(dz, self.activations[i].T)
db = (1/m) * np.sum(dz, axis=1, keepdims=True)
if i > 0:
dz = np.dot(self.weights[i].T, dz) * self.sigmoid_derivative(self.z_values[i-1])
# 更新参数
self.weights[i] -= learning_rate * dw
self.biases[i] -= learning_rate * db
1.3 梯度下降
梯度下降是优化神经网络参数的核心算法。
参数更新公式:
w := w - α·∂L/∂w
批量梯度下降(BGD):
for epoch in range(num_epochs):
# 使用全部数据计算梯度
gradients = compute_gradients(X_train, y_train)
weights -= learning_rate * gradients
随机梯度下降(SGD):
for epoch in range(num_epochs):
for i in range(n_samples):
# 每次使用一个样本
gradients = compute_gradients(X_train[i], y_train[i])
weights -= learning_rate * gradients
小批量梯度下降(Mini-batch GD):
for epoch in range(num_epochs):
for batch in get_batches(X_train, y_train, batch_size):
X_batch, y_batch = batch
gradients = compute_gradients(X_batch, y_batch)
weights -= learning_rate * gradients
1.4 激活函数
激活函数引入非线性,使神经网络能够学习复杂模式。
| 激活函数 | 公式 | 导数 | 优点 | 缺点 |
|---|---|---|---|---|
| Sigmoid | σ(x)=1/(1+e^(-x)) | σ'(x)=σ(x)(1-σ(x)) | 输出(0,1),易理解 | 梯度消失,计算慢 |
| Tanh | tanh(x)=(e^x-e^(-x))/(e^x+e^(-x)) | 1-tanh²(x) | 输出(-1,1),零中心 | 梯度消失 |
| ReLU | max(0,x) | x>0:1, x≤0:0 | 计算快,缓解梯度消失 | Dead ReLU问题 |
| Leaky ReLU | max(0.01x,x) | x>0:1, x≤0:0.01 | 解决Dead ReLU | 需调参 |
| GELU | x·Φ(x) | 复杂 | Transformer常用 | 计算复杂 |
PyTorch实现:
import torch.nn as nn
# 不同激活函数对比
activations = {
'ReLU': nn.ReLU(),
'Sigmoid': nn.Sigmoid(),
'Tanh': nn.Tanh(),
'LeakyReLU': nn.LeakyReLU(0.01),
'GELU': nn.GELU()
}
x = torch.linspace(-5, 5, 100)
for name, activation in activations.items():
y = activation(x)
print(f"{name}: min={y.min():.2f}, max={y.max():.2f}")
2. 优化器
2.1 SGD (Stochastic Gradient Descent)
基础的随机梯度下降:
w := w - α·∇L(w)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
2.2 Momentum
引入动量,加速收敛:
v := β·v + ∇L(w)
w := w - α·v
optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
2.3 AdaGrad
自适应调整每个参数的学习率:
G := G + (∇L(w))²
w := w - α/(√G + ε) · ∇L(w)
问题: 学习率单调递减,可能过早停止学习。
2.4 Adam (推荐)
结合Momentum和RMSprop的优点:
# 一阶矩估计(Momentum)
m := β₁·m + (1-β₁)·∇L(w)
# 二阶矩估计(RMSprop)
v := β₂·v + (1-β₂)·(∇L(w))²
# 偏差修正
m̂ := m / (1-β₁ᵗ)
v̂ := v / (1-β₂ᵗ)
# 参数更新
w := w - α·m̂ / (√v̂ + ε)
PyTorch实现:
optimizer = torch.optim.Adam(
model.parameters(),
lr=0.001, # 学习率
betas=(0.9, 0.999), # β₁, β₂
eps=1e-8, # 防止除零
weight_decay=0.01 # L2正则化
)
优化器对比:
# 完整训练示例
import torch.optim as optim
def train_with_optimizer(model, optimizer_name):
if optimizer_name == 'SGD':
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
elif optimizer_name == 'Adam':
optimizer = optim.Adam(model.parameters(), lr=0.001)
elif optimizer_name == 'AdamW':
optimizer = optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)
for epoch in range(num_epochs):
for batch_x, batch_y in dataloader:
optimizer.zero_grad() # 清空梯度
output = model(batch_x) # 前向传播
loss = criterion(output, batch_y)
loss.backward() # 反向传播
optimizer.step() # 更新参数
3. CNN (卷积神经网络)
3.1 卷积层
卷积操作提取局部特征:
计算公式:
输出尺寸 = (输入尺寸 - 卷积核尺寸 + 2×填充) / 步长 + 1
PyTorch实现:
class CNN(nn.Module):
def __init__(self):
super(CNN, self).__init__()
# 卷积层: in_channels, out_channels, kernel_size
self.conv1 = nn.Conv2d(3, 32, kernel_size=3, stride=1, padding=1)
self.conv2 = nn.Conv2d(32, 64, kernel_size=3, stride=1, padding=1)
self.pool = nn.MaxPool2d(2, 2) # 池化层
self.fc = nn.Linear(64 * 8 * 8, 10)
def forward(self, x):
# x: [batch, 3, 32, 32]
x = self.pool(torch.relu(self.conv1(x))) # [batch, 32, 16, 16]
x = self.pool(torch.relu(self.conv2(x))) # [batch, 64, 8, 8]
x = x.view(x.size(0), -1) # 展平
x = self.fc(x) # 全连接层
return x
3.2 池化层
降低特征维度,保持重要特征:
- 最大池化(Max Pooling): 取窗口内最大值
- 平均池化(Average Pooling): 取窗口内平均值
# 最大池化
max_pool = nn.MaxPool2d(kernel_size=2, stride=2)
# 平均池化
avg_pool = nn.AvgPool2d(kernel_size=2, stride=2)
# 自适应池化(输出固定尺寸)
adaptive_pool = nn.AdaptiveAvgPool2d((1, 1))
3.3 经典CNN网络
LeNet-5 (1998)
最早的CNN网络,用于手写数字识别:
class LeNet5(nn.Module):
def __init__(self):
super(LeNet5, self).__init__()
self.conv1 = nn.Conv2d(1, 6, 5)
self.conv2 = nn.Conv2d(6, 16, 5)
self.fc1 = nn.Linear(16*4*4, 120)
self.fc2 = nn.Linear(120, 84)
self.fc3 = nn.Linear(84, 10)
def forward(self, x):
x = torch.relu(self.conv1(x))
x = torch.max_pool2d(x, 2)
x = torch.relu(self.conv2(x))
x = torch.max_pool2d(x, 2)
x = x.view(-1, 16*4*4)
x = torch.relu(self.fc1(x))
x = torch.relu(self.fc2(x))
x = self.fc3(x)
return x
AlexNet (2012)
ImageNet冠军,深度学习崛起标志:
- 8层网络(5卷积+3全连接)
- 使用ReLU激活函数
- Dropout防止过拟合
- 数据增强
VGG (2014)
使用小卷积核(3×3)堆叠:
class VGG16(nn.Module):
def __init__(self, num_classes=1000):
super(VGG16, self).__init__()
self.features = nn.Sequential(
# Block 1
nn.Conv2d(3, 64, 3, padding=1),
nn.ReLU(inplace=True),
nn.Conv2d(64, 64, 3, padding=1),
nn.ReLU(inplace=True),
nn.MaxPool2d(2, 2),
# Block 2
nn.Conv2d(64, 128, 3, padding=1),
nn.ReLU(inplace=True),
nn.Conv2d(128, 128, 3, padding=1),
nn.ReLU(inplace=True),
nn.MaxPool2d(2, 2),
# ... 更多层
)
ResNet (2015)
引入残差连接,解决梯度消失:
class ResidualBlock(nn.Module):
def __init__(self, in_channels, out_channels, stride=1):
super(ResidualBlock, self).__init__()
self.conv1 = nn.Conv2d(in_channels, out_channels, 3, stride, 1)
self.bn1 = nn.BatchNorm2d(out_channels)
self.conv2 = nn.Conv2d(out_channels, out_channels, 3, 1, 1)
self.bn2 = nn.BatchNorm2d(out_channels)
# 跳跃连接
self.shortcut = nn.Sequential()
if stride != 1 or in_channels != out_channels:
self.shortcut = nn.Sequential(
nn.Conv2d(in_channels, out_channels, 1, stride),
nn.BatchNorm2d(out_channels)
)
def forward(self, x):
residual = x
out = torch.relu(self.bn1(self.conv1(x)))
out = self.bn2(self.conv2(out))
out += self.shortcut(residual) # 残差连接
out = torch.relu(out)
return out
4. RNN/LSTM/GRU
4.1 RNN (循环神经网络)
处理序列数据,具有记忆能力:
前向传播:
h_t = tanh(W_hh·h_{t-1} + W_xh·x_t + b_h)
y_t = W_hy·h_t + b_y
问题:
- 梯度消失: 长序列时梯度趋近0
- 梯度爆炸: 梯度指数增长
4.2 LSTM (长短期记忆网络)
通过门控机制解决长期依赖问题:
三个门:
遗忘门: f_t = σ(W_f·[h_{t-1}, x_t] + b_f)
输入门: i_t = σ(W_i·[h_{t-1}, x_t] + b_i)
输出门: o_t = σ(W_o·[h_{t-1}, x_t] + b_o)
候选值: C̃_t = tanh(W_C·[h_{t-1}, x_t] + b_C)
细胞状态: C_t = f_t ⊙ C_{t-1} + i_t ⊙ C̃_t
隐藏状态: h_t = o_t ⊙ tanh(C_t)
PyTorch实现:
class LSTMModel(nn.Module):
def __init__(self, input_size, hidden_size, num_layers, output_size):
super(LSTMModel, self).__init__()
self.lstm = nn.LSTM(
input_size=input_size,
hidden_size=hidden_size,
num_layers=num_layers,
batch_first=True,
dropout=0.2
)
self.fc = nn.Linear(hidden_size, output_size)
def forward(self, x):
# x: [batch, seq_len, input_size]
lstm_out, (h_n, c_n) = self.lstm(x)
# lstm_out: [batch, seq_len, hidden_size]
# 取最后一个时间步的输出
output = self.fc(lstm_out[:, -1, :])
return output
# 使用示例
model = LSTMModel(input_size=10, hidden_size=64, num_layers=2, output_size=5)
x = torch.randn(32, 20, 10) # [batch, seq_len, features]
output = model(x)
print(f"输出形状: {output.shape}") # [32, 5]
4.3 GRU (门控循环单元)
LSTM的简化版本,参数更少:
两个门:
重置门: r_t = σ(W_r·[h_{t-1}, x_t] + b_r)
更新门: z_t = σ(W_z·[h_{t-1}, x_t] + b_z)
候选隐藏状态: h̃_t = tanh(W·[r_t ⊙ h_{t-1}, x_t] + b)
隐藏状态: h_t = (1 - z_t) ⊙ h_{t-1} + z_t ⊙ h̃_t
LSTM vs GRU对比:
| 特性 | LSTM | GRU |
|---|---|---|
| 门数量 | 3个(遗忘/输入/输出) | 2个(重置/更新) |
| 参数量 | 更多 | 更少 |
| 计算速度 | 较慢 | 较快 |
| 表达能力 | 更强 | 稍弱 |
| 适用场景 | 复杂长序列 | 简单序列,资源受限 |
5. Transformer
5.1 Self-Attention
Attention机制的核心公式:
Attention(Q, K, V) = softmax(QK^T / √d_k)V
- Q(Query): 查询向量
- K(Key): 键向量
- V(Value): 值向量
- d_k: Key维度(用于缩放)
PyTorch实现:
import torch.nn.functional as F
class SelfAttention(nn.Module):
def __init__(self, embed_dim):
super(SelfAttention, self).__init__()
self.embed_dim = embed_dim
self.query = nn.Linear(embed_dim, embed_dim)
self.key = nn.Linear(embed_dim, embed_dim)
self.value = nn.Linear(embed_dim, embed_dim)
def forward(self, x):
# x: [batch, seq_len, embed_dim]
Q = self.query(x) # [batch, seq_len, embed_dim]
K = self.key(x)
V = self.value(x)
# 计算注意力分数
scores = torch.matmul(Q, K.transpose(-2, -1)) / (self.embed_dim ** 0.5)
# scores: [batch, seq_len, seq_len]
# Softmax归一化
attention_weights = F.softmax(scores, dim=-1)
# 加权求和
output = torch.matmul(attention_weights, V)
return output, attention_weights
5.2 Multi-Head Attention
并行多个Attention,捕捉不同子空间的信息:
class MultiHeadAttention(nn.Module):
def __init__(self, embed_dim, num_heads):
super(MultiHeadAttention, self).__init__()
assert embed_dim % num_heads == 0
self.embed_dim = embed_dim
self.num_heads = num_heads
self.head_dim = embed_dim // num_heads
self.qkv_proj = nn.Linear(embed_dim, embed_dim * 3)
self.out_proj = nn.Linear(embed_dim, embed_dim)
def forward(self, x):
batch_size, seq_len, embed_dim = x.shape
# 线性投影并分割成多头
qkv = self.qkv_proj(x) # [batch, seq_len, embed_dim*3]
qkv = qkv.reshape(batch_size, seq_len, self.num_heads, 3*self.head_dim)
qkv = qkv.permute(0, 2, 1, 3) # [batch, num_heads, seq_len, 3*head_dim]
q, k, v = qkv.chunk(3, dim=-1)
# 缩放点积注意力
scores = torch.matmul(q, k.transpose(-2, -1)) / (self.head_dim ** 0.5)
attention = F.softmax(scores, dim=-1)
out = torch.matmul(attention, v)
# 合并多头
out = out.permute(0, 2, 1, 3).reshape(batch_size, seq_len, embed_dim)
out = self.out_proj(out)
return out
5.3 位置编码
Transformer无法感知序列顺序,需要添加位置信息:
正弦位置编码:
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
class PositionalEncoding(nn.Module):
def __init__(self, d_model, max_len=5000):
super(PositionalEncoding, self).__init__()
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2).float() *
(-np.log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
pe = pe.unsqueeze(0) # [1, max_len, d_model]
self.register_buffer('pe', pe)
def forward(self, x):
# x: [batch, seq_len, d_model]
x = x + self.pe[:, :x.size(1), :]
return x
5.4 Transformer vs RNN/CNN
| 特性 | Transformer | RNN/LSTM | CNN |
|---|---|---|---|
| 并行化 | 高(可并行) | 低(顺序处理) | 高 |
| 长距离依赖 | 直接建模 | 梯度消失 | 需多层堆叠 |
| 计算复杂度 | O(n²·d) | O(n·d²) | O(n·d²·k) |
| 内存消耗 | 高 | 低 | 中 |
| 适用场景 | NLP、多模态 | 时序数据 | 图像、局部特征 |
6. 高频面试题
面试题1: 解释梯度消失和梯度爆炸,如何解决?
答案:
见第1章面试题4的详细解答。补充Transformer如何解决:
Transformer的优势:
- 无梯度消失: 每个位置直接通过Self-Attention连接,梯度路径最短
- 残差连接: 每个子层都有跳跃连接
- 层归一化: Layer Normalization稳定训练
class TransformerBlock(nn.Module):
def __init__(self, embed_dim, num_heads):
super().__init__()
self.attention = MultiHeadAttention(embed_dim, num_heads)
self.norm1 = nn.LayerNorm(embed_dim)
self.ffn = nn.Sequential(
nn.Linear(embed_dim, 4*embed_dim),
nn.ReLU(),
nn.Linear(4*embed_dim, embed_dim)
)
self.norm2 = nn.LayerNorm(embed_dim)
def forward(self, x):
# 残差连接 + LayerNorm
x = x + self.attention(self.norm1(x))
x = x + self.ffn(self.norm2(x))
return x
面试题2: BatchNorm和LayerNorm的区别?
答案:
Batch Normalization:
- 沿batch维度归一化
- 公式:
x̂ = (x - μ_batch) / √(σ²_batch + ε) - 训练和推理行为不同(需保存running mean/var)
- 适用于CNN,batch size较大时
Layer Normalization:
- 沿特征维度归一化
- 公式:
x̂ = (x - μ_layer) / √(σ²_layer + ε) - 训练和推理行为一致
- 适用于RNN/Transformer,batch size较小时
对比表格:
| 特性 | BatchNorm | LayerNorm |
|---|---|---|
| 归一化维度 | Batch维度 | Feature维度 |
| 适用模型 | CNN | RNN/Transformer |
| Batch依赖 | 是 | 否 |
| 训练/推理 | 不同 | 相同 |
| 小batch | 不稳定 | 稳定 |
代码示例:
# BatchNorm (CNN)
bn = nn.BatchNorm2d(num_features=64)
x = torch.randn(32, 64, 28, 28) # [batch, channels, H, W]
out = bn(x) # 沿batch和空间维度归一化
# LayerNorm (Transformer)
ln = nn.LayerNorm(normalized_shape=512)
x = torch.randn(32, 100, 512) # [batch, seq_len, embed_dim]
out = ln(x) # 沿embed_dim归一化
面试题3: 1×1卷积的作用?
答案:
1×1卷积虽然无法提取空间特征,但有重要作用:
1. 降维/升维
# 降维: 减少计算量
conv1x1 = nn.Conv2d(in_channels=256, out_channels=64, kernel_size=1)
# 输入: [batch, 256, H, W] → 输出: [batch, 64, H, W]
2. 增加非线性
nn.Sequential(
nn.Conv2d(64, 64, 1),
nn.ReLU(), # 增加非线性变换
)
3. 跨通道信息融合
4. Inception模块中的应用:
class InceptionModule(nn.Module):
def __init__(self, in_channels):
super().__init__()
# 1x1卷积降维
self.branch1x1 = nn.Conv2d(in_channels, 64, 1)
self.branch3x3 = nn.Sequential(
nn.Conv2d(in_channels, 96, 1), # 降维
nn.Conv2d(96, 128, 3, padding=1)
)
self.branch5x5 = nn.Sequential(
nn.Conv2d(in_channels, 16, 1), # 降维
nn.Conv2d(16, 32, 5, padding=2)
)
面试题4: Attention机制的时间复杂度和优化方法?
答案:
标准Attention复杂度:
时间: O(n²·d) (n: 序列长度, d: 维度)
空间: O(n²) (存储attention矩阵)
当序列很长时(如n=10000),复杂度过高。
优化方法:
1. Sparse Attention (稀疏注意力)
只关注局部窗口或特定模式
复杂度: O(n·√n·d) 或 O(n·log(n)·d)
2. Linformer
将K、V投影到低维
复杂度: O(n·k·d) (k << n)
3. Performer (线性Attention)
使用核方法近似softmax
复杂度: O(n·d²)
4. Flash Attention
# PyTorch 2.0+内置
from torch.nn.functional import scaled_dot_product_attention
output = scaled_dot_product_attention(
query, key, value,
attn_mask=None,
dropout_p=0.0,
is_causal=False
)
# 通过IO优化和kernel融合加速
面试题5: 如何选择CNN、RNN还是Transformer?
答案:
决策树:
数据类型?
├── 图像
│ ├── 局部特征重要 → CNN (ResNet, EfficientNet)
│ ├── 全局关系重要 → Vision Transformer (ViT)
│ └── 混合 → ConvNeXt, Swin Transformer
│
├── 文本/序列
│ ├── 短序列(<100) → LSTM/GRU
│ ├── 长序列 → Transformer
│ ├── 流式处理 → RNN/LSTM
│ └── 离线处理 → Transformer (BERT, GPT)
│
├── 时序数据
│ ├── 单变量 → LSTM
│ ├── 多变量 → Transformer
│ └── 需要预测未来 → Informer, Autoformer
│
└── 多模态
└── Transformer (CLIP, Flamingo)
选择建议:
| 场景 | 推荐模型 | 理由 |
|---|---|---|
| 图像分类 | CNN/ViT | CNN效率高,ViT大数据下效果好 |
| 目标检测 | CNN (YOLO, Faster R-CNN) | 局部特征重要 |
| 机器翻译 | Transformer | 并行化,长距离依赖 |
| 语音识别 | Transformer (Whisper) | 端到端,效果最好 |
| 时序预测 | LSTM/Transformer | LSTM简单,Transformer效果好 |
| 视频理解 | 3D CNN/Video Transformer | 时空特征建模 |
代码示例(混合模型):
class HybridModel(nn.Module):
"""CNN提取特征 + Transformer建模关系"""
def __init__(self):
super().__init__()
# CNN backbone
self.cnn = nn.Sequential(
nn.Conv2d(3, 64, 3, padding=1),
nn.ReLU(),
nn.MaxPool2d(2),
# ...
)
# Transformer encoder
self.transformer = nn.TransformerEncoder(
nn.TransformerEncoderLayer(d_model=512, nhead=8),
num_layers=6
)
self.fc = nn.Linear(512, num_classes)
def forward(self, x):
# CNN提取局部特征
features = self.cnn(x) # [batch, channels, H, W]
# 展平为序列
features = features.flatten(2).permute(0, 2, 1) # [batch, H*W, channels]
# Transformer建模全局关系
features = self.transformer(features)
# 分类
output = self.fc(features.mean(dim=1))
return output