Huggingface Accelerate
最簡單的分佈式訓練 API。只需 4 行代碼即可為任何 PyTorch 腳本添加分佈式支持。為 DeepSpeed/FSDP/Megatron/DDP 提供統一的 API。自動設備放置、混合精度(FP16/BF16/FP8)。交互式配置,單一啟動命令。HuggingFace 生態系統標準。
技能元數據
| 來源 | 可選 — 使用 hermes skills install official/mlops/accelerate 安裝 |
| 路徑 | optional-skills/mlops/accelerate |
| 版本 | 1.0.0 |
| 作者 | Orchestra Research |
| 許可證 | MIT |
| 依賴項 | accelerate, torch, transformers |
| 標籤 | Distributed Training, HuggingFace, Accelerate, DeepSpeed, FSDP, Mixed Precision, PyTorch, DDP, Unified API, Simple |
參考:完整 SKILL.md
以下是 Hermes 在觸發此技能時加載的完整技能定義。這是技能激活時代理看到的指令。
HuggingFace Accelerate - 統一分佈式訓練
快速開始
Accelerate 將分佈式訓練簡化為 4 行代碼。
安裝:
pip install accelerate
轉換 PyTorch 腳本(4 行):
import torch
+ from accelerate import Accelerator
+ accelerator = Accelerator()
model = torch.nn.Transformer()
optimizer = torch.optim.Adam(model.parameters())
dataloader = torch.utils.data.DataLoader(dataset)
+ model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
for batch in dataloader:
optimizer.zero_grad()
loss = model(batch)
- loss.backward()
+ accelerator.backward(loss)
optimizer.step()
運行(單一命令):
accelerate launch train.py
常見工作流
工作流 1:從單 GPU 到多 GPU
原始腳本:
# train.py
import torch
model = torch.nn.Linear(10, 2).to('cuda')
optimizer = torch.optim.Adam(model.parameters())
dataloader = torch.utils.data.DataLoader(dataset, batch_size=32)
for epoch in range(10):
for batch in dataloader:
batch = batch.to('cuda')
optimizer.zero_grad()
loss = model(batch).mean()
loss.backward()
optimizer.step()
使用 Accelerate(添加 4 行):
# train.py
import torch
from accelerate import Accelerator # +1
accelerator = Accelerator() # +2
model = torch.nn.Linear(10, 2)
optimizer = torch.optim.Adam(model.parameters())
dataloader = torch.utils.data.DataLoader(dataset, batch_size=32)
model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader) # +3
for epoch in range(10):
for batch in dataloader:
# No .to('cuda') needed - automatic!
optimizer.zero_grad()
loss = model(batch).mean()
accelerator.backward(loss) # +4
optimizer.step()
配置(交互式):
accelerate config
問題:
- 哪種機器?(單/多 GPU/TPU/CPU)
- 多少臺機器?(1)
- 混合精度?(no/fp16/bf16/fp8)
- DeepSpeed?(no/yes)
啟動(適用於任何設置):
# Single GPU
accelerate launch train.py
# Multi-GPU (8 GPUs)
accelerate launch --multi_gpu --num_processes 8 train.py
# Multi-node
accelerate launch --multi_gpu --num_processes 16 \
--num_machines 2 --machine_rank 0 \
--main_process_ip $MASTER_ADDR \
train.py
工作流 2:混合精度訓練
啟用 FP16/BF16:
from accelerate import Accelerator
# FP16 (with gradient scaling)
accelerator = Accelerator(mixed_precision='fp16')
# BF16 (no scaling, more stable)
accelerator = Accelerator(mixed_precision='bf16')
# FP8 (H100+)
accelerator = Accelerator(mixed_precision='fp8')
model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
# Everything else is automatic!
for batch in dataloader:
with accelerator.autocast(): # Optional, done automatically
loss = model(batch)
accelerator.backward(loss)
工作流 3:DeepSpeed ZeRO 集成
啟用 DeepSpeed ZeRO-2:
from accelerate import Accelerator
accelerator = Accelerator(
mixed_precision='bf16',
deepspeed_plugin={
"zero_stage": 2, # ZeRO-2
"offload_optimizer": False,
"gradient_accumulation_steps": 4
}
)
# Same code as before!
model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
或通過配置:
accelerate config
# Select: DeepSpeed → ZeRO-2
deepspeed_config.json:
{
"fp16": {"enabled": false},
"bf16": {"enabled": true},
"zero_optimization": {
"stage": 2,
"offload_optimizer": {"device": "cpu"},
"allgather_bucket_size": 5e8,
"reduce_bucket_size": 5e8
}
}
啟動:
accelerate launch --config_file deepspeed_config.json train.py
工作流 4:FSDP(完全分片數據並行)
啟用 FSDP:
from accelerate import Accelerator, FullyShardedDataParallelPlugin
fsdp_plugin = FullyShardedDataParallelPlugin(
sharding_strategy="FULL_SHARD", # ZeRO-3 equivalent
auto_wrap_policy="TRANSFORMER_AUTO_WRAP",
cpu_offload=False
)
accelerator = Accelerator(
mixed_precision='bf16',
fsdp_plugin=fsdp_plugin
)
model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
或通過配置:
accelerate config
# Select: FSDP → Full Shard → No CPU Offload
工作流 5:梯度累積
累積梯度:
from accelerate import Accelerator
accelerator = Accelerator(gradient_accumulation_steps=4)
model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
for batch in dataloader:
with accelerator.accumulate(model): # Handles accumulation
optimizer.zero_grad()
loss = model(batch)
accelerator.backward(loss)
optimizer.step()
有效批次大小:batch_size * num_gpus * gradient_accumulation_steps
何時使用及替代方案對比
在以下情況使用 Accelerate:
- 希望獲得最簡單的分佈式訓練體驗
- 需要單個腳本適配任何硬件
- 使用 HuggingFace 生態系統
- 需要靈活性(DDP/DeepSpeed/FSDP/Megatron)
- 需要快速原型開發
主要優勢:
- 4 行代碼:最小化代碼更改
- 統一 API:DDP、DeepSpeed、FSDP、Megatron 使用相同代碼
- 自動化:設備放置、混合精度、分片
- 交互式配置:無需手動設置啟動器
- 單一啟動:隨處可用
改用替代方案的情況:
- PyTorch Lightning:需要回調、高層抽象
- Ray Train:多節點編排、超參數調優
- DeepSpeed:直接 API 控制、高級功能
- 原生 DDP:最大控制權、最小抽象
常見問題
問題:設備放置錯誤
不要手動移動到設備:
# WRONG
batch = batch.to('cuda')
# CORRECT
# Accelerate handles it automatically after prepare()
問題:梯度累積不起作用
使用上下文管理器:
# CORRECT
with accelerator.accumulate(model):
optimizer.zero_grad()
accelerator.backward(loss)
optimizer.step()
問題:分佈式環境中的檢查點保存
使用 accelerator 方法:
# Save only on main process
if accelerator.is_main_process:
accelerator.save_state('checkpoint/')
# Load on all processes
accelerator.load_state('checkpoint/')
問題:FSDP 結果不一致
確保相同的隨機種子:
from accelerate.utils import set_seed
set_seed(42)
高級主題
Megatron 集成:參見 references/megatron-integration.md 瞭解張量並行、流水線並行和序列並行的設置。
自定義插件:參見 references/custom-plugins.md 瞭解如何創建自定義分佈式插件和高級配置。
性能調優:參見 references/performance.md 瞭解性能分析、內存優化和最佳實踐。
硬件要求
- CPU:可用(較慢)
- 單 GPU:可用
- 多 GPU:DDP(默認)、DeepSpeed 或 FSDP
- 多節點:DDP、DeepSpeed、FSDP、Megatron
- TPU:支持
- Apple MPS:支持
啟動器要求:
- DDP:
torch.distributed.run(內置) - DeepSpeed:
deepspeed(pip install deepspeed) - FSDP:PyTorch 1.12+(內置)
- Megatron:自定義設置
資源
- 文檔:https://huggingface.co/docs/accelerate
- GitHub:https://github.com/huggingface/accelerate
- 版本:1.11.0+
- 教程:"Accelerate your scripts"
- 示例:https://github.com/huggingface/accelerate/tree/main/examples
- 使用者:HuggingFace Transformers, TRL, PEFT, 所有 HF 庫