1.3 深度学习基础

深度学习是机器学习的子集，通过多层神经网络对复杂数据（图像、文本、语音）进行表征学习。Transformer 架构是现代大语言模型（GPT、BERT、LLaMA）的技术底座，是本节的核心重点。

一、神经网络原理

1. 神经元与网络结构

单个神经元

生物神经元接收信号、处理后决定是否激活。人工神经元模拟这个过程：

输入 x₁ ──→ ×w₁ ──┐
输入 x₂ ──→ ×w₂ ──┼──→ Σ(wᵢxᵢ) + b ──→ activation(z) ──→ 输出 y
输入 x₃ ──→ ×w₃ ──┘

输入（Input）：$x_1, x_2, ..., x_n$
权重（Weight）：$w_1, w_2, ..., w_n$，表示每个输入的重要程度
偏置（Bias）：$b$，控制激活阈值
线性变换：$z = \sum(w_i \cdot x_i) + b$
激活函数：$y = \text{activation}(z)$，引入非线性

激活函数（Activation Function）

激活函数是神经网络能拟合复杂非线性关系的关键。没有激活函数，多层网络等价于单层线性变换。

Sigmoid

$$\sigma(x) = \frac{1}{1 + e^{-x}}$$

输出范围：(0, 1)
优点：平滑可导，输出可解释为概率
缺点：梯度消失（深层网络前层参数几乎不更新）、计算慢、非零中心
应用：输出层（二分类）

Tanh

$$\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$$

输出范围：(-1, 1)
优点：零中心化，梯度比 Sigmoid 稍大
缺点：仍有梯度消失问题
应用：LSTM 内部

ReLU（Rectified Linear Unit）

$$f(x) = \max(0, x)$$

输出范围：[0, +∞)
优点：计算快、缓解梯度消失、产生稀疏激活
缺点：神经元死亡问题（输入为负时梯度为0，一旦"死亡"无法恢复）
应用：隐藏层首选

Leaky ReLU

$$f(x) = \max(0.01x, x)$$

解决 ReLU 神经元死亡问题，负半轴保留微小梯度

GELU（Gaussian Error Linear Unit）

$$\text{GELU}(x) = x \cdot \Phi(x)$$

平滑版本的 ReLU，Transformer 模型常用（GPT、BERT 均使用 GELU）

python

import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt
import numpy as np

x = torch.linspace(-5, 5, 200)

fig, axes = plt.subplots(1, 5, figsize=(25, 4))
activations = {
    "Sigmoid": torch.sigmoid(x),
    "Tanh": torch.tanh(x),
    "ReLU": F.relu(x),
    "Leaky ReLU": F.leaky_relu(x, 0.01),
    "GELU": F.gelu(x),
}
for ax, (name, y) in zip(axes, activations.items()):
    ax.plot(x.numpy(), y.numpy())
    ax.set_title(name)
    ax.axhline(y=0, color='k', linewidth=0.5)
    ax.axvline(x=0, color='k', linewidth=0.5)
    ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

多层网络结构

输入层          隐藏层1        隐藏层2        输出层
(Input)        (Hidden)       (Hidden)      (Output)
 ○─────────────○──────────────○──────────────○
 ○─────────────○──────────────○──────────────○
 ○─────────────○──────────────○
               ○──────────────○

输入层（Input Layer）：接收原始数据（如图像像素、文本向量）
隐藏层（Hidden Layer）：逐层提取和转换特征
输出层（Output Layer）：生成最终预测（分类概率、回归值）
全连接层（Dense / Fully Connected）：每个神经元与前一层所有神经元相连

网络越深，能学习的特征越抽象：浅层学边缘/纹理 → 中层学局部形状 → 深层学高级语义。

2. 前向传播（Forward Propagation）

数据从输入层逐层流向输出层，每层执行线性变换 + 非线性激活。

python

import torch
import torch.nn as nn

# 手动实现两层神经网络的前向传播
def forward(X, W1, b1, W2, b2):
    # 第一层：线性变换 + ReLU 激活
    z1 = X @ W1 + b1
    a1 = torch.relu(z1)

    # 第二层（输出层）：线性变换 + Softmax
    z2 = a1 @ W2 + b2
    output = torch.softmax(z2, dim=-1)  # 多分类概率
    return output

# 示例
X = torch.randn(4, 10)    # 4个样本，10维输入
W1 = torch.randn(10, 64)  # 第一层：10→64
b1 = torch.zeros(64)
W2 = torch.randn(64, 3)   # 第二层：64→3（3个类别）
b2 = torch.zeros(3)

probs = forward(X, W1, b1, W2, b2)
print(f"输出概率: {probs.shape}")   # (4, 3)
print(f"概率和: {probs.sum(dim=1)}")  # 每行和为1

3. 损失函数（Loss Function）

损失函数衡量模型预测与真实值之间的差距，是优化的目标。

分类任务

交叉熵损失（Cross-Entropy Loss）

$$L = -\frac{1}{n}\sum_{i=1}^{n}\sum_{c=1}^{C} y_{ic} \log(\hat{y}_{ic})$$

python

import torch.nn as nn

# 二分类
bce_loss = nn.BCELoss()              # 输入需要经过 Sigmoid
bce_with_logits = nn.BCEWithLogitsLoss()  # 内部包含 Sigmoid，更稳定

# 多分类（最常用）
ce_loss = nn.CrossEntropyLoss()  # 内部包含 Softmax + NLLLoss

# 示例
logits = torch.randn(4, 3)     # 4个样本，3个类别的 logit
labels = torch.tensor([0, 1, 2, 1])  # 真实类别
loss = ce_loss(logits, labels)
print(f"交叉熵损失: {loss.item():.4f}")

回归任务

python

mse_loss = nn.MSELoss()   # 均方误差：L = (1/n)Σ(y - ŷ)²
mae_loss = nn.L1Loss()    # 平均绝对误差：L = (1/n)Σ|y - ŷ|

y_pred = torch.tensor([2.5, 0.0, 1.0])
y_true = torch.tensor([3.0, -0.5, 1.5])

print(f"MSE: {mse_loss(y_pred, y_true).item():.4f}")
print(f"MAE: {mae_loss(y_pred, y_true).item():.4f}")

4. 反向传播（Backpropagation）

反向传播是训练神经网络的核心算法——利用链式法则从输出层向输入层逐层计算每个参数的梯度。

核心思想：

前向传播：计算预测值和损失
计算输出层梯度
逐层反向传播梯度（链式法则）
用梯度更新参数

链式法则：

$$\frac{\partial L}{\partial w} = \frac{\partial L}{\partial y} \times \frac{\partial y}{\partial z} \times \frac{\partial z}{\partial w}$$

梯度下降更新：$w_{new} = w_{old} - \eta \cdot \frac{\partial L}{\partial w}$（$\eta$ 为学习率）

python

# PyTorch 自动微分示例
x = torch.tensor([2.0, 3.0], requires_grad=True)
y = x[0] ** 2 + 3 * x[1]  # y = x₁² + 3x₂

y.backward()  # 反向传播，自动计算梯度

print(f"x = {x.data}")
print(f"dy/dx₁ = 2*x₁ = {x.grad[0]:.1f}")  # 4.0
print(f"dy/dx₂ = 3    = {x.grad[1]:.1f}")   # 3.0

梯度消失与梯度爆炸

深层网络训练中最常见的两个问题：

梯度消失（Vanishing Gradient）

现象：前层参数几乎不更新，网络越深越严重
原因：Sigmoid/Tanh 的导数 < 1，多层连乘后梯度趋近于 0
解决方法：
- 使用 ReLU/GELU 激活函数
- 残差连接（Skip Connection）
- Batch Normalization / Layer Normalization
- 合适的权重初始化（He init / Xavier init）

梯度爆炸（Exploding Gradient）

现象：参数更新过大，模型发散（loss 变成 NaN）
解决方法：
- 梯度裁剪（Gradient Clipping）：限制梯度最大范数
- 合适的权重初始化
- 使用更小的学习率

python

# 梯度裁剪
model = nn.Linear(10, 5)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)  # 梯度裁剪
optimizer.step()

5. 优化算法（Optimization）

梯度下降变体

方法	每次使用数据量	优点	缺点
批量 GD（Batch GD）	全部训练数据	稳定	慢、内存大
随机 GD（SGD）	1 个样本	快、可跳出局部最优	不稳定、震荡
小批量 GD（Mini-batch）	一小批（32/64/128）	平衡速度与稳定性	实际应用首选

高级优化器

SGD + Momentum（动量）

累积历史梯度方向，加速收敛、减少震荡：

$$v_t = \beta v_{t-1} + \nabla L$$ $$w = w - \eta v_t$$

python

optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

Adam（Adaptive Moment Estimation）

结合 Momentum 和 RMSprop，自适应调整每个参数的学习率。深度学习最常用优化器。

python

optimizer = torch.optim.Adam(
    model.parameters(),
    lr=0.001,          # 学习率
    betas=(0.9, 0.999),  # β₁: 动量衰减, β₂: 二阶矩衰减
    eps=1e-8,            # 数值稳定性
    weight_decay=0       # L2 正则化
)

AdamW

Adam 的改进版本，修正了权重衰减的实现方式。Transformer 模型训练标配。

python

optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=5e-5,
    weight_decay=0.01  # 权重衰减（正确的解耦实现）
)

学习率调度（Learning Rate Schedule）

训练过程中动态调整学习率可以显著改善效果：

python

from torch.optim.lr_scheduler import (
    StepLR, CosineAnnealingLR, OneCycleLR, LinearLR, SequentialLR
)

optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)

# 阶梯衰减：每 30 个 epoch 学习率 × 0.1
scheduler = StepLR(optimizer, step_size=30, gamma=0.1)

# 余弦退火：学习率按余弦曲线从高到低
scheduler = CosineAnnealingLR(optimizer, T_max=100)

# Warmup + 余弦退火（Transformer 训练常用）
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=500)
cosine = CosineAnnealingLR(optimizer, T_max=10000)
scheduler = SequentialLR(optimizer, schedulers=[warmup, cosine], milestones=[500])

# 在训练循环中使用
for epoch in range(num_epochs):
    for batch in dataloader:
        # ... 训练 ...
        pass
    scheduler.step()  # 每个 epoch 更新学习率

6. 正则化技术

Dropout

训练时随机丢弃一定比例的神经元，测试时使用全部神经元。

python

class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(784, 256)
        self.dropout = nn.Dropout(p=0.5)  # 50% 丢弃率
        self.fc2 = nn.Linear(256, 10)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = self.dropout(x)    # 训练时随机丢弃
        x = self.fc2(x)
        return x

model = Net()
model.train()  # 训练模式：Dropout 生效
model.eval()   # 评估模式：Dropout 关闭，使用全部神经元

Batch Normalization（批归一化）

对每一层的输出按 batch 维度进行标准化。

python

class ConvBlock(nn.Module):
    def __init__(self, in_ch, out_ch):
        super().__init__()
        self.conv = nn.Conv2d(in_ch, out_ch, 3, padding=1)
        self.bn = nn.BatchNorm2d(out_ch)  # BN 在卷积之后
        self.relu = nn.ReLU()

    def forward(self, x):
        return self.relu(self.bn(self.conv(x)))

优点：加速收敛、缓解梯度消失、允许更大学习率
应用：CNN 中卷积层或全连接层之后

Layer Normalization

对每个样本的所有特征维度归一化（不依赖 batch 维度）。

python

# Layer Normalization —— Transformer 的标配
layer_norm = nn.LayerNorm(d_model)  # d_model 是特征维度

# Transformer 中的使用
x = x + self_attention(x)    # 残差连接
x = layer_norm(x)            # Layer Norm

与 BN 的区别：BN 沿 batch 维度归一化，LN 沿特征维度归一化
LN 不依赖 batch size，更适合序列模型和 Transformer

数据增强（Data Augmentation）

通过变换扩充训练数据，提高模型泛化能力。

python

from torchvision import transforms

# 图像数据增强
train_transform = transforms.Compose([
    transforms.RandomHorizontalFlip(),             # 随机水平翻转
    transforms.RandomRotation(15),                 # 随机旋转 ±15°
    transforms.RandomResizedCrop(224, scale=(0.8, 1.0)),  # 随机裁剪
    transforms.ColorJitter(brightness=0.2, contrast=0.2),  # 颜色抖动
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                         std=[0.229, 0.224, 0.225]),
])

# 文本数据增强思路
# - 同义词替换："好" → "优秀"
# - 回译：中文 → 英文 → 中文
# - 随机删除/交换/插入

二、CNN — 卷积神经网络

CNN 是处理图像最经典高效的架构。虽然 Vision Transformer（ViT）正在崛起，但 CNN 的核心思想（局部连接、权重共享）依然被广泛使用。

1. 为什么需要 CNN

全连接网络处理图像的问题：

参数爆炸：一张 224×224×3 的图像有 150,528 个像素，全连接到 1000 个神经元就需要 1.5 亿参数
忽略空间结构：把图像展平为一维向量，丢失了像素之间的空间关系
容易过拟合：参数太多，训练数据不够

CNN 的特点：局部连接（只看局部区域）、权重共享（一个卷积核扫全图）、空间不变性（特征在哪里出现都能检测到）。

2. 卷积层（Convolutional Layer）

卷积核在图像上滑动，每个位置做元素乘法并求和，提取局部特征（边缘、纹理、形状）。

关键参数：

参数	含义	典型值
kernel_size	卷积核大小	3×3、5×5
stride	步长（每次移动距离）	1、2
padding	边缘填充	0（Valid）、1（Same）
out_channels	卷积核数量（输出通道数）	32、64、128、256

输出尺寸计算：

$$\text{output_size} = \frac{\text{input_size} - \text{kernel_size} + 2 \times \text{padding}}{\text{stride}} + 1$$

python

import torch.nn as nn

# 卷积层示例
conv = nn.Conv2d(
    in_channels=3,     # 输入通道（RGB=3）
    out_channels=32,   # 输出通道（32个卷积核）
    kernel_size=3,     # 3×3 卷积核
    stride=1,
    padding=1          # Same padding
)

x = torch.randn(1, 3, 224, 224)   # (batch, channels, height, width)
out = conv(x)
print(f"输入: {x.shape}")    # (1, 3, 224, 224)
print(f"输出: {out.shape}")  # (1, 32, 224, 224)

3. 池化层（Pooling Layer）

降低特征图尺寸，减少参数量，增强特征不变性。

python

# 最大池化（常用）：取窗口内最大值
max_pool = nn.MaxPool2d(kernel_size=2, stride=2)
# 输入 (1, 32, 224, 224) → 输出 (1, 32, 112, 112)

# 平均池化：取窗口内平均值
avg_pool = nn.AvgPool2d(kernel_size=2, stride=2)

# 全局平均池化：将每个通道压缩为单个值
global_avg_pool = nn.AdaptiveAvgPool2d(1)
# 输入 (1, 512, 7, 7) → 输出 (1, 512, 1, 1)

4. 经典 CNN 架构

架构	年份	层数	核心创新	意义
LeNet-5	1998	5	最早的 CNN	手写数字识别
AlexNet	2012	8	ReLU、Dropout、GPU 训练	开启深度学习时代
VGG	2014	16/19	小卷积核（3×3）堆叠	结构简洁清晰
ResNet	2015	50/101/152	残差连接 Skip Connection	解决深层退化问题
Inception	2014	22	多尺度卷积并行	减少参数量
MobileNet	2017	-	深度可分离卷积	轻量级，适合移动端

ResNet 残差连接（最重要的创新之一）

python

class ResidualBlock(nn.Module):
    """残差块：y = F(x) + x"""

    def __init__(self, channels):
        super().__init__()
        self.conv1 = nn.Conv2d(channels, channels, 3, padding=1)
        self.bn1 = nn.BatchNorm2d(channels)
        self.conv2 = nn.Conv2d(channels, channels, 3, padding=1)
        self.bn2 = nn.BatchNorm2d(channels)

    def forward(self, x):
        residual = x                              # 保存输入
        out = F.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        out = out + residual                      # 残差连接：直接加上输入
        return F.relu(out)

残差连接是理解 Transformer 的关键——Transformer 的每个子层都使用了残差连接 + Layer Norm。

5. CNN 实战：CIFAR-10 图像分类

python

import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# --- 数据准备 ---
transform_train = transforms.Compose([
    transforms.RandomHorizontalFlip(),
    transforms.RandomCrop(32, padding=4),
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
])

transform_test = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
])

train_set = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform_train)
test_set = datasets.CIFAR10(root='./data', train=False, download=True, transform=transform_test)

train_loader = DataLoader(train_set, batch_size=128, shuffle=True, num_workers=2)
test_loader = DataLoader(test_set, batch_size=128, shuffle=False, num_workers=2)

# --- CNN 模型 ---
class SimpleCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.features = nn.Sequential(
            # Block 1: 3 → 32
            nn.Conv2d(3, 32, 3, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d(2),              # 32×32 → 16×16

            # Block 2: 32 → 64
            nn.Conv2d(32, 64, 3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.MaxPool2d(2),              # 16×16 → 8×8

            # Block 3: 64 → 128
            nn.Conv2d(64, 128, 3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(),
            nn.AdaptiveAvgPool2d(1),      # 8×8 → 1×1
        )
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Dropout(0.5),
            nn.Linear(128, 10),
        )

    def forward(self, x):
        x = self.features(x)
        x = self.classifier(x)
        return x

# --- 训练 ---
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = SimpleCNN().to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()

for epoch in range(20):
    model.train()
    total_loss = 0
    for images, labels in train_loader:
        images, labels = images.to(device), labels.to(device)

        outputs = model(images)
        loss = criterion(outputs, labels)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        total_loss += loss.item()

    # 评估
    model.eval()
    correct, total = 0, 0
    with torch.no_grad():
        for images, labels in test_loader:
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            _, predicted = outputs.max(1)
            total += labels.size(0)
            correct += predicted.eq(labels).sum().item()

    print(f"Epoch {epoch+1:2d} | Loss: {total_loss/len(train_loader):.4f} | "
          f"Accuracy: {100.*correct/total:.2f}%")

三、RNN / LSTM — 序列模型

1. 循环神经网络（RNN）

RNN 处理序列数据（文本、时间序列、语音），通过隐藏状态传递"记忆"。

x₁ → [RNN] → h₁ → y₁
       ↓
x₂ → [RNN] → h₂ → y₂    （h₂ 依赖 h₁）
       ↓
x₃ → [RNN] → h₃ → y₃    （h₃ 依赖 h₂）

隐藏状态更新：

$$h_t = \tanh(W_{hh} \cdot h_{t-1} + W_{xh} \cdot x_t + b)$$

RNN 的问题：

顺序处理：必须等 t-1 计算完才能算 t，无法并行
梯度消失：长序列中前面的信息几乎传不到后面
长期依赖困难：难以记住几十步之前的信息

2. LSTM（长短期记忆网络）

LSTM 引入门控机制和细胞状态（长期记忆通道），解决 RNN 的梯度消失问题。

三个门：

遗忘门（Forget Gate）：$f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)$ — 决定丢弃多少旧信息
输入门（Input Gate）：$i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)$ — 决定存储多少新信息
输出门（Output Gate）：$o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)$ — 决定输出多少信息

细胞状态更新： $$\tilde{C}t = \tanh(W_C \cdot [h, x_t] + b_C)$$ $$C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t$$ $$h_t = o_t \odot \tanh(C_t)$$

python

import torch.nn as nn

# LSTM 用于情感分析
class SentimentLSTM(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(
            input_size=embed_dim,
            hidden_size=hidden_dim,
            num_layers=2,
            batch_first=True,
            bidirectional=True,    # 双向 LSTM
            dropout=0.3
        )
        self.fc = nn.Linear(hidden_dim * 2, num_classes)  # *2 因为双向

    def forward(self, x):
        # x: (batch, seq_len) — 词索引序列
        embedded = self.embedding(x)           # (batch, seq_len, embed_dim)
        output, (h_n, c_n) = self.lstm(embedded)  # output: (batch, seq_len, hidden*2)
        # 取最后一个时间步的输出
        last_output = output[:, -1, :]         # (batch, hidden*2)
        logits = self.fc(last_output)          # (batch, num_classes)
        return logits

model = SentimentLSTM(vocab_size=10000, embed_dim=128, hidden_dim=256, num_classes=2)

3. GRU（门控循环单元）

LSTM 的简化版，只有两个门（重置门、更新门），参数更少，性能接近 LSTM。

python

# GRU 和 LSTM 接口几乎一样
gru = nn.GRU(input_size=128, hidden_size=256, num_layers=2, batch_first=True)

4. RNN 的局限

顺序处理，无法利用 GPU 并行计算
长序列仍有信息衰减问题
Transformer 已在 NLP 领域基本取代 RNN

了解 RNN/LSTM 的思想有助于理解"为什么 Transformer 是革命性的"，但现代 AI 应用开发中直接使用的场景已较少。

四、Transformer 架构

这是最重要的部分。 Transformer 是 GPT、BERT、LLaMA、Claude 等所有现代大模型的技术基础。

1. 为什么 Transformer 是革命性的

2017 年 Google 论文《Attention is All You Need》提出 Transformer，彻底改变了 NLP。

对比项	RNN/LSTM	Transformer
计算方式	顺序处理	并行计算
长距离依赖	随距离衰减	任意位置距离为1
训练速度	慢（无法并行）	快（全并行）
可扩展性	有限	越大越强（Scaling Law）

2. Self-Attention（自注意力机制）

核心思想：计算序列中每个位置对所有其他位置的关注程度，动态加权组合信息。

直观理解：处理句子 "The animal didn't cross the street because it was too tired" 时，Self-Attention 处理到 "it" 时会关注到 "animal"，理解 "it" 指代的是动物。

计算步骤

Step 1：线性变换 — 将输入映射为 Query、Key、Value 三个向量

Q = X · W_Q    (Query：我在找什么)
K = X · W_K    (Key：我能提供什么)
V = X · W_V    (Value：我的实际内容)

Step 2：计算注意力分数 — Query 与 Key 的相似度

$$\text{Score} = \frac{Q \cdot K^T}{\sqrt{d_k}}$$

除以 $\sqrt{d_k}$ 是为了防止点积过大导致 Softmax 梯度消失。

Step 3：Softmax 归一化 — 得到注意力权重

$$\text{Attention_weights} = \text{softmax}(\text{Score})$$

Step 4：加权求和 — 用权重组合 Value

$$\text{Output} = \text{Attention_weights} \cdot V$$

完整公式：

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

python

import torch
import torch.nn.functional as F
import math

def scaled_dot_product_attention(Q, K, V, mask=None):
    """缩放点积注意力"""
    d_k = Q.size(-1)
    # (batch, heads, seq_len, d_k) @ (batch, heads, d_k, seq_len)
    scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)

    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))

    attention_weights = F.softmax(scores, dim=-1)
    output = torch.matmul(attention_weights, V)
    return output, attention_weights

# 示例
batch_size, seq_len, d_model = 2, 10, 64
Q = torch.randn(batch_size, seq_len, d_model)
K = torch.randn(batch_size, seq_len, d_model)
V = torch.randn(batch_size, seq_len, d_model)

output, weights = scaled_dot_product_attention(Q, K, V)
print(f"输出: {output.shape}")           # (2, 10, 64)
print(f"注意力权重: {weights.shape}")     # (2, 10, 10)

3. Multi-Head Attention（多头注意力）

使用多个注意力头并行计算，每个头关注不同的特征子空间（如语法关系、语义指代、位置关系等）。

python

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        assert d_model % num_heads == 0

        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads

        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

    def forward(self, Q, K, V, mask=None):
        batch_size = Q.size(0)

        # 1. 线性变换并分头
        # (batch, seq_len, d_model) → (batch, num_heads, seq_len, d_k)
        Q = self.W_q(Q).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        K = self.W_k(K).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        V = self.W_v(V).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)

        # 2. 每个头独立计算注意力
        attn_output, attn_weights = scaled_dot_product_attention(Q, K, V, mask)

        # 3. 合并多头
        # (batch, num_heads, seq_len, d_k) → (batch, seq_len, d_model)
        attn_output = attn_output.transpose(1, 2).contiguous().view(
            batch_size, -1, self.d_model
        )

        # 4. 最终线性变换
        output = self.W_o(attn_output)
        return output

mha = MultiHeadAttention(d_model=512, num_heads=8)
x = torch.randn(2, 10, 512)  # (batch, seq_len, d_model)
output = mha(x, x, x)        # Self-Attention: Q=K=V=x
print(f"Multi-Head 输出: {output.shape}")  # (2, 10, 512)

4. Transformer 完整架构

                 Transformer
    ┌──────────────┐    ┌──────────────┐
    │   Encoder    │    │   Decoder    │
    │              │    │              │
    │  ┌────────┐  │    │  ┌────────┐  │
    │  │  MHA   │  │    │  │Masked  │  │
    │  │(Self)  │  │    │  │  MHA   │  │
    │  └───┬────┘  │    │  └───┬────┘  │
    │  Add & Norm  │    │  Add & Norm  │
    │  ┌────────┐  │    │  ┌────────┐  │
    │  │  FFN   │  │    │  │Cross   │  │
    │  └───┬────┘  │    │  │  MHA   │←─┤── Encoder 输出
    │  Add & Norm  │    │  └───┬────┘  │
    │              │    │  Add & Norm  │
    │    × N 层    │    │  ┌────────┐  │
    └──────────────┘    │  │  FFN   │  │
                        │  └───┬────┘  │
                        │  Add & Norm  │
                        │              │
                        │    × N 层    │
                        └──────────────┘

Encoder（编码器）

python

class TransformerEncoderBlock(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        self.self_attention = MultiHeadAttention(d_model, num_heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),     # d_model → 4*d_model
            nn.GELU(),
            nn.Linear(d_ff, d_model),     # 4*d_model → d_model
        )
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        # Self-Attention + 残差连接 + LayerNorm
        attn_output = self.self_attention(x, x, x, mask)
        x = self.norm1(x + self.dropout(attn_output))

        # FFN + 残差连接 + LayerNorm
        ffn_output = self.ffn(x)
        x = self.norm2(x + self.dropout(ffn_output))
        return x

Decoder（解码器）

Decoder 比 Encoder 多了两个关键点：

Masked Self-Attention：用因果掩码（Causal Mask）防止位置 i 看到位置 i+1 及之后的内容
Cross-Attention：Query 来自 Decoder，Key 和 Value 来自 Encoder 输出

python

class TransformerDecoderBlock(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        self.masked_self_attention = MultiHeadAttention(d_model, num_heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.cross_attention = MultiHeadAttention(d_model, num_heads)
        self.norm2 = nn.LayerNorm(d_model)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),
            nn.Linear(d_ff, d_model),
        )
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, encoder_output, src_mask=None, tgt_mask=None):
        # 1. Masked Self-Attention（看不到未来 token）
        attn = self.masked_self_attention(x, x, x, tgt_mask)
        x = self.norm1(x + self.dropout(attn))

        # 2. Cross-Attention（与 Encoder 交互）
        attn = self.cross_attention(x, encoder_output, encoder_output, src_mask)
        x = self.norm2(x + self.dropout(attn))

        # 3. FFN
        ffn_out = self.ffn(x)
        x = self.norm3(x + self.dropout(ffn_out))
        return x

位置编码（Positional Encoding）

Transformer 并行处理所有 token，本身没有顺序概念，必须注入位置信息。

python

class PositionalEncoding(nn.Module):
    """正弦/余弦位置编码"""

    def __init__(self, d_model, max_len=5000):
        super().__init__()
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1).float()
        div_term = torch.exp(
            torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model)
        )
        pe[:, 0::2] = torch.sin(position * div_term)  # 偶数维：sin
        pe[:, 1::2] = torch.cos(position * div_term)  # 奇数维：cos
        pe = pe.unsqueeze(0)  # (1, max_len, d_model)
        self.register_buffer('pe', pe)

    def forward(self, x):
        # x: (batch, seq_len, d_model)
        return x + self.pe[:, :x.size(1)]

三角函数编码：固定，不需要学习，原始 Transformer 使用
可学习编码：作为参数学习，BERT/GPT 等使用

5. Transformer 变体

变体	使用部分	方向	预训练任务	代表应用
BERT	仅 Encoder	双向	Masked LM + NSP	文本分类、NER、问答
GPT	仅 Decoder	自回归（左到右）	语言建模（预测下一个词）	文本生成、对话、代码
T5/BART	Encoder + Decoder	编码-解码	多种任务统一格式	翻译、摘要
ViT	仅 Encoder	—	图像分块为序列	图像分类

GPT（Decoder-Only）是当前大语言模型的主流架构：GPT-4、Claude、LLaMA 等均为 Decoder-Only。

6. Transformer 的优势

并行计算：所有位置同时处理，充分利用 GPU
长距离依赖：Self-Attention 直接建模任意两个位置的关系
可扩展性：更大模型 + 更多数据 = 持续提升性能（Scaling Law）
迁移学习：预训练 + 微调范式，一次训练多次复用

五、深度学习框架

1. PyTorch（推荐）

PyTorch 是 AI 研究和 LLM 开发的事实标准。所有主流开源大模型（LLaMA、Mistral、ChatGLM 等）均使用 PyTorch。

核心概念

张量（Tensor）

python

import torch

# 创建张量
x = torch.tensor([1.0, 2.0, 3.0])
y = torch.zeros(3, 4)                # 全零
z = torch.randn(2, 3)                # 标准正态分布随机
ones = torch.ones(2, 3)              # 全一
eye = torch.eye(3)                   # 单位矩阵

# 张量运算
a = torch.randn(3, 4)
b = torch.randn(4, 5)
c = a @ b                            # 矩阵乘法 (3, 5)
d = a * 2                            # 逐元素乘法

# GPU 加速
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
x = x.to(device)
print(f"设备: {x.device}")

# 形状操作
x = torch.randn(2, 3, 4)
print(x.shape)               # torch.Size([2, 3, 4])
print(x.view(2, 12).shape)   # torch.Size([2, 12])
print(x.permute(0, 2, 1).shape)  # torch.Size([2, 4, 3])

自动微分（Autograd）

python

# PyTorch 自动计算梯度——反向传播的基础
x = torch.tensor([2.0], requires_grad=True)
y = x ** 2 + 3 * x  # y = x² + 3x

y.backward()          # 自动反向传播
print(x.grad)         # dy/dx = 2x + 3 = 7.0

# 梯度不会自动清零，需要手动清理
x.grad.zero_()

构建神经网络（nn.Module）

python

import torch.nn as nn

class SimpleNet(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super().__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(0.3)
        self.fc2 = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.dropout(x)
        x = self.fc2(x)
        return x

model = SimpleNet(784, 128, 10)
print(f"参数量: {sum(p.numel() for p in model.parameters()):,}")

完整训练循环

python

import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset

# --- 数据准备 ---
X_train = torch.randn(1000, 784)
y_train = torch.randint(0, 10, (1000,))
dataset = TensorDataset(X_train, y_train)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

# --- 模型、损失、优化器 ---
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = SimpleNet(784, 128, 10).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# --- 训练循环 ---
num_epochs = 20

for epoch in range(num_epochs):
    model.train()                  # 训练模式（Dropout/BN 生效）
    total_loss = 0

    for batch_x, batch_y in dataloader:
        batch_x, batch_y = batch_x.to(device), batch_y.to(device)

        # 前向传播
        outputs = model(batch_x)
        loss = criterion(outputs, batch_y)

        # 反向传播
        optimizer.zero_grad()      # 清空梯度
        loss.backward()            # 计算梯度
        optimizer.step()           # 更新参数

        total_loss += loss.item()

    if (epoch + 1) % 5 == 0:
        print(f"Epoch [{epoch+1}/{num_epochs}] Loss: {total_loss/len(dataloader):.4f}")

# --- 评估 ---
model.eval()                       # 评估模式（关闭 Dropout/BN）
with torch.no_grad():              # 不计算梯度，节省内存
    outputs = model(X_train.to(device))
    _, predicted = outputs.max(1)
    accuracy = (predicted == y_train.to(device)).float().mean()
    print(f"训练集准确率: {accuracy:.4f}")

# --- 保存和加载模型 ---
torch.save(model.state_dict(), 'model.pth')          # 保存参数
model.load_state_dict(torch.load('model.pth'))        # 加载参数

常用模块

模块	功能	常用类
`torch.nn`	网络层	Linear, Conv2d, LSTM, Transformer, Embedding
`torch.optim`	优化器	Adam, AdamW, SGD
`torch.utils.data`	数据加载	DataLoader, Dataset, TensorDataset
`torchvision`	计算机视觉	transforms, datasets, models

2. TensorFlow / Keras（可选）

TensorFlow 在工业部署场景有优势（TensorFlow Serving、TF Lite、TF.js）：

python

import tensorflow as tf
from tensorflow import keras

# Keras 高级 API：快速构建模型
model = keras.Sequential([
    keras.layers.Dense(128, activation='relu', input_shape=(784,)),
    keras.layers.Dropout(0.3),
    keras.layers.Dense(10, activation='softmax')
])

model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

model.fit(x_train, y_train, epochs=10, batch_size=32, validation_split=0.1)
model.evaluate(x_test, y_test)

框架选择建议：

AI 应用开发 / LLM 相关：PyTorch（生态丰富、调试方便、开源模型主流）
移动端部署：TensorFlow Lite
浏览器推理：TensorFlow.js
快速原型：Keras

六、实战项目

项目1：MNIST 手写数字识别（入门）

python

import torch
import torch.nn as nn
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# 数据
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))])
train_data = datasets.MNIST('./data', train=True, download=True, transform=transform)
test_data = datasets.MNIST('./data', train=False, transform=transform)
train_loader = DataLoader(train_data, batch_size=64, shuffle=True)
test_loader = DataLoader(test_data, batch_size=1000)

# 模型
class MNISTNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.net = nn.Sequential(
            nn.Linear(784, 256), nn.ReLU(), nn.Dropout(0.2),
            nn.Linear(256, 128), nn.ReLU(), nn.Dropout(0.2),
            nn.Linear(128, 10)
        )
    def forward(self, x):
        return self.net(self.flatten(x))

# 训练
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = MNISTNet().to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()

for epoch in range(10):
    model.train()
    for images, labels in train_loader:
        images, labels = images.to(device), labels.to(device)
        loss = criterion(model(images), labels)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    # 测试
    model.eval()
    correct = sum(
        (model(img.to(device)).argmax(1) == lab.to(device)).sum().item()
        for img, lab in test_loader
    )
    print(f"Epoch {epoch+1}: 准确率 {100.*correct/len(test_data):.2f}%")

项目2：IMDB 情感分析（LSTM 实践）

python

import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from torchtext.datasets import IMDB
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

# 构建词表
tokenizer = get_tokenizer('basic_english')

def yield_tokens(data_iter):
    for label, text in data_iter:
        yield tokenizer(text)

train_iter = IMDB(split='train')
vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=['<unk>', '<pad>'])
vocab.set_default_index(vocab['<unk>'])

# 模型
class IMDBClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim=128, hidden_dim=256):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=1)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, num_layers=2,
                            batch_first=True, bidirectional=True, dropout=0.3)
        self.fc = nn.Linear(hidden_dim * 2, 2)

    def forward(self, x):
        embedded = self.embedding(x)
        _, (h_n, _) = self.lstm(embedded)
        # 拼接前向和后向最后一层的隐藏状态
        hidden = torch.cat([h_n[-2], h_n[-1]], dim=1)
        return self.fc(hidden)

model = IMDBClassifier(vocab_size=len(vocab))

项目3：使用 HuggingFace 微调 BERT（Transformer 实践）

python

from transformers import BertTokenizer, BertForSequenceClassification
from transformers import Trainer, TrainingArguments
from datasets import load_dataset

# 1. 加载数据和模型
dataset = load_dataset("imdb")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

# 2. 数据预处理
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=256)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

# 3. 训练配置
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    warmup_steps=500,
)

# 4. 训练
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"].select(range(5000)),  # 取子集快速实验
    eval_dataset=tokenized_datasets["test"].select(range(1000)),
)

trainer.train()

# 5. 推理
from transformers import pipeline

classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)
result = classifier("This movie is absolutely amazing!")
print(result)  # [{'label': 'POSITIVE', 'score': 0.9998}]

七、学习资源

类型	资源	说明
课程	吴恩达《深度学习专项课程》	系统全面，数学直观
课程	李宏毅《深度学习》	中文讲解，深入浅出
课程	Stanford CS231n	计算机视觉经典课程
课程	Stanford CS224n	NLP 经典课程
书籍	《深度学习》（花书）	理论权威，偏数学
书籍	《动手学深度学习》	代码实践，PyTorch/TensorFlow
论文	Attention is All You Need	Transformer 原始论文，必读
博客	Jay Alammar 图解 Transformer	最佳可视化讲解
文档	PyTorch 官方文档	最佳学习资料
平台	HuggingFace	Transformer 模型库与数据集

八、深度学习在 AI 应用开发中的定位

学习重点

优先级	内容	原因
必须深入	Transformer 架构、Self-Attention 机制	所有大模型的技术基础
需要掌握	PyTorch 基础、训练流程、优化技巧	微调开源模型必备
了解即可	CNN、RNN/LSTM	除非做专门的图像/时序任务

与 LLM 的关系

深度学习是基础，LLM 是深度学习在自然语言上的极致应用
掌握 Transformer 原理后，理解 GPT/BERT/LLaMA 的架构差异会很轻松
实际应用中无需从头训练大模型，重点是使用 API 和微调开源模型

学习建议

Transformer 优先：它是现代 AI 的核心，手推一遍 Attention 公式，读懂图解 Transformer
PyTorch 必修：通过实战项目掌握训练流程，所有 LLM 开源代码都用 PyTorch
HuggingFace 必学：工业界使用预训练模型的标准工具
数学理解概念即可：不必纠结推导，重点在工程实践

1.3 深度学习基础 ​

一、神经网络原理 ​

1. 神经元与网络结构 ​

单个神经元 ​

激活函数（Activation Function） ​

多层网络结构 ​

2. 前向传播（Forward Propagation） ​

3. 损失函数（Loss Function） ​

分类任务 ​

回归任务 ​

4. 反向传播（Backpropagation） ​

梯度消失与梯度爆炸 ​

5. 优化算法（Optimization） ​

梯度下降变体 ​

高级优化器 ​

学习率调度（Learning Rate Schedule） ​

6. 正则化技术 ​

Dropout ​

Batch Normalization（批归一化） ​

Layer Normalization ​

数据增强（Data Augmentation） ​

二、CNN — 卷积神经网络 ​

1. 为什么需要 CNN ​

2. 卷积层（Convolutional Layer） ​

3. 池化层（Pooling Layer） ​

4. 经典 CNN 架构 ​

ResNet 残差连接（最重要的创新之一） ​

5. CNN 实战：CIFAR-10 图像分类 ​

三、RNN / LSTM — 序列模型 ​

1. 循环神经网络（RNN） ​

2. LSTM（长短期记忆网络） ​

3. GRU（门控循环单元） ​

4. RNN 的局限 ​

四、Transformer 架构 ​

1. 为什么 Transformer 是革命性的 ​

2. Self-Attention（自注意力机制） ​

计算步骤 ​

3. Multi-Head Attention（多头注意力） ​

4. Transformer 完整架构 ​

Encoder（编码器） ​

Decoder（解码器） ​

位置编码（Positional Encoding） ​

5. Transformer 变体 ​

6. Transformer 的优势 ​

五、深度学习框架 ​

1. PyTorch（推荐） ​

核心概念 ​

2. TensorFlow / Keras（可选） ​

六、实战项目 ​

项目1：MNIST 手写数字识别（入门） ​

项目2：IMDB 情感分析（LSTM 实践） ​

项目3：使用 HuggingFace 微调 BERT（Transformer 实践） ​

七、学习资源 ​

八、深度学习在 AI 应用开发中的定位 ​

学习重点 ​

与 LLM 的关系 ​

学习建议 ​

1.3 深度学习基础

一、神经网络原理

1. 神经元与网络结构

单个神经元

激活函数（Activation Function）

多层网络结构

2. 前向传播（Forward Propagation）

3. 损失函数（Loss Function）

分类任务

回归任务

4. 反向传播（Backpropagation）

梯度消失与梯度爆炸

5. 优化算法（Optimization）

梯度下降变体

高级优化器

学习率调度（Learning Rate Schedule）

6. 正则化技术

Dropout

Batch Normalization（批归一化）

Layer Normalization

数据增强（Data Augmentation）

二、CNN — 卷积神经网络

1. 为什么需要 CNN

2. 卷积层（Convolutional Layer）

3. 池化层（Pooling Layer）

4. 经典 CNN 架构

ResNet 残差连接（最重要的创新之一）

5. CNN 实战：CIFAR-10 图像分类

三、RNN / LSTM — 序列模型

1. 循环神经网络（RNN）

2. LSTM（长短期记忆网络）

3. GRU（门控循环单元）

4. RNN 的局限

四、Transformer 架构

1. 为什么 Transformer 是革命性的

2. Self-Attention（自注意力机制）

计算步骤

3. Multi-Head Attention（多头注意力）

4. Transformer 完整架构

Encoder（编码器）

Decoder（解码器）

位置编码（Positional Encoding）

5. Transformer 变体

6. Transformer 的优势

五、深度学习框架

1. PyTorch（推荐）

核心概念

2. TensorFlow / Keras（可选）

六、实战项目

项目1：MNIST 手写数字识别（入门）

项目2：IMDB 情感分析（LSTM 实践）

项目3：使用 HuggingFace 微调 BERT（Transformer 实践）

七、学习资源

八、深度学习在 AI 应用开发中的定位

学习重点

与 LLM 的关系

学习建议