跳到主要內容

Huggingface Accelerate

最簡單的分佈式訓練 API。只需 4 行代碼即可為任何 PyTorch 腳本添加分佈式支持。為 DeepSpeed/FSDP/Megatron/DDP 提供統一的 API。自動設備放置、混合精度(FP16/BF16/FP8)。交互式配置,單一啟動命令。HuggingFace 生態系統標準。

技能元數據

來源可選 — 使用 hermes skills install official/mlops/accelerate 安裝
路徑optional-skills/mlops/accelerate
版本1.0.0
作者Orchestra Research
許可證MIT
依賴項accelerate, torch, transformers
標籤Distributed Training, HuggingFace, Accelerate, DeepSpeed, FSDP, Mixed Precision, PyTorch, DDP, Unified API, Simple

參考:完整 SKILL.md

信息

以下是 Hermes 在觸發此技能時加載的完整技能定義。這是技能激活時代理看到的指令。

HuggingFace Accelerate - 統一分佈式訓練

快速開始

Accelerate 將分佈式訓練簡化為 4 行代碼。

安裝

pip install accelerate

轉換 PyTorch 腳本(4 行):

import torch
+ from accelerate import Accelerator

+ accelerator = Accelerator()

model = torch.nn.Transformer()
optimizer = torch.optim.Adam(model.parameters())
dataloader = torch.utils.data.DataLoader(dataset)

+ model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)

for batch in dataloader:
optimizer.zero_grad()
loss = model(batch)
- loss.backward()
+ accelerator.backward(loss)
optimizer.step()

運行(單一命令):

accelerate launch train.py

常見工作流

工作流 1:從單 GPU 到多 GPU

原始腳本

# train.py
import torch

model = torch.nn.Linear(10, 2).to('cuda')
optimizer = torch.optim.Adam(model.parameters())
dataloader = torch.utils.data.DataLoader(dataset, batch_size=32)

for epoch in range(10):
for batch in dataloader:
batch = batch.to('cuda')
optimizer.zero_grad()
loss = model(batch).mean()
loss.backward()
optimizer.step()

使用 Accelerate(添加 4 行):

# train.py
import torch
from accelerate import Accelerator # +1

accelerator = Accelerator() # +2

model = torch.nn.Linear(10, 2)
optimizer = torch.optim.Adam(model.parameters())
dataloader = torch.utils.data.DataLoader(dataset, batch_size=32)

model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader) # +3

for epoch in range(10):
for batch in dataloader:
# No .to('cuda') needed - automatic!
optimizer.zero_grad()
loss = model(batch).mean()
accelerator.backward(loss) # +4
optimizer.step()

配置(交互式):

accelerate config

問題

  • 哪種機器?(單/多 GPU/TPU/CPU)
  • 多少臺機器?(1)
  • 混合精度?(no/fp16/bf16/fp8)
  • DeepSpeed?(no/yes)

啟動(適用於任何設置):

# Single GPU
accelerate launch train.py

# Multi-GPU (8 GPUs)
accelerate launch --multi_gpu --num_processes 8 train.py

# Multi-node
accelerate launch --multi_gpu --num_processes 16 \
--num_machines 2 --machine_rank 0 \
--main_process_ip $MASTER_ADDR \
train.py

工作流 2:混合精度訓練

啟用 FP16/BF16

from accelerate import Accelerator

# FP16 (with gradient scaling)
accelerator = Accelerator(mixed_precision='fp16')

# BF16 (no scaling, more stable)
accelerator = Accelerator(mixed_precision='bf16')

# FP8 (H100+)
accelerator = Accelerator(mixed_precision='fp8')

model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)

# Everything else is automatic!
for batch in dataloader:
with accelerator.autocast(): # Optional, done automatically
loss = model(batch)
accelerator.backward(loss)

工作流 3:DeepSpeed ZeRO 集成

啟用 DeepSpeed ZeRO-2

from accelerate import Accelerator

accelerator = Accelerator(
mixed_precision='bf16',
deepspeed_plugin={
"zero_stage": 2, # ZeRO-2
"offload_optimizer": False,
"gradient_accumulation_steps": 4
}
)

# Same code as before!
model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)

或通過配置

accelerate config
# Select: DeepSpeed → ZeRO-2

deepspeed_config.json

{
"fp16": {"enabled": false},
"bf16": {"enabled": true},
"zero_optimization": {
"stage": 2,
"offload_optimizer": {"device": "cpu"},
"allgather_bucket_size": 5e8,
"reduce_bucket_size": 5e8
}
}

啟動

accelerate launch --config_file deepspeed_config.json train.py

工作流 4:FSDP(完全分片數據並行)

啟用 FSDP

from accelerate import Accelerator, FullyShardedDataParallelPlugin

fsdp_plugin = FullyShardedDataParallelPlugin(
sharding_strategy="FULL_SHARD", # ZeRO-3 equivalent
auto_wrap_policy="TRANSFORMER_AUTO_WRAP",
cpu_offload=False
)

accelerator = Accelerator(
mixed_precision='bf16',
fsdp_plugin=fsdp_plugin
)

model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)

或通過配置

accelerate config
# Select: FSDP → Full Shard → No CPU Offload

工作流 5:梯度累積

累積梯度

from accelerate import Accelerator

accelerator = Accelerator(gradient_accumulation_steps=4)

model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)

for batch in dataloader:
with accelerator.accumulate(model): # Handles accumulation
optimizer.zero_grad()
loss = model(batch)
accelerator.backward(loss)
optimizer.step()

有效批次大小batch_size * num_gpus * gradient_accumulation_steps

何時使用及替代方案對比

在以下情況使用 Accelerate

  • 希望獲得最簡單的分佈式訓練體驗
  • 需要單個腳本適配任何硬件
  • 使用 HuggingFace 生態系統
  • 需要靈活性(DDP/DeepSpeed/FSDP/Megatron)
  • 需要快速原型開發

主要優勢

  • 4 行代碼:最小化代碼更改
  • 統一 API:DDP、DeepSpeed、FSDP、Megatron 使用相同代碼
  • 自動化:設備放置、混合精度、分片
  • 交互式配置:無需手動設置啟動器
  • 單一啟動:隨處可用

改用替代方案的情況

  • PyTorch Lightning:需要回調、高層抽象
  • Ray Train:多節點編排、超參數調優
  • DeepSpeed:直接 API 控制、高級功能
  • 原生 DDP:最大控制權、最小抽象

常見問題

問題:設備放置錯誤

不要手動移動到設備:

# WRONG
batch = batch.to('cuda')

# CORRECT
# Accelerate handles it automatically after prepare()

問題:梯度累積不起作用

使用上下文管理器:

# CORRECT
with accelerator.accumulate(model):
optimizer.zero_grad()
accelerator.backward(loss)
optimizer.step()

問題:分佈式環境中的檢查點保存

使用 accelerator 方法:

# Save only on main process
if accelerator.is_main_process:
accelerator.save_state('checkpoint/')

# Load on all processes
accelerator.load_state('checkpoint/')

問題:FSDP 結果不一致

確保相同的隨機種子:

from accelerate.utils import set_seed
set_seed(42)

高級主題

Megatron 集成:參見 references/megatron-integration.md 瞭解張量並行、流水線並行和序列並行的設置。

自定義插件:參見 references/custom-plugins.md 瞭解如何創建自定義分佈式插件和高級配置。

性能調優:參見 references/performance.md 瞭解性能分析、內存優化和最佳實踐。

硬件要求

  • CPU:可用(較慢)
  • 單 GPU:可用
  • 多 GPU:DDP(默認)、DeepSpeed 或 FSDP
  • 多節點:DDP、DeepSpeed、FSDP、Megatron
  • TPU:支持
  • Apple MPS:支持

啟動器要求

  • DDPtorch.distributed.run(內置)
  • DeepSpeeddeepspeed(pip install deepspeed)
  • FSDP:PyTorch 1.12+(內置)
  • Megatron:自定義設置

資源