Slime Rl Training

提供使用 slime（一個 Megatron+SGLang 框架）通過強化學習（RL）進行大語言模型（LLM）後訓練的指導。適用於訓練 GLM 模型、實現自定義數據生成工作流，或需要緊密集成 Megatron-LM 以進行 RL 擴展的場景。

技能元數據


來源	可選 — 使用 `hermes skills install official/mlops/slime` 安裝
路徑	`optional-skills/mlops/slime`
版本	`1.0.0`
作者	Orchestra Research
許可證	MIT
依賴項	`sglang-router>=0.2.3`, `ray`, `torch>=2.0.0`, `transformers>=4.40.0`
標籤	`Reinforcement Learning`, `Megatron-LM`, `SGLang`, `GRPO`, `Post-Training`, `GLM`

參考：完整 SKILL.md

信息

以下是 Hermes 在觸發此技能時加載的完整技能定義。這是技能激活時代理所看到的指令。

slime：用於 RL 擴展的 LLM 後訓練框架

slime 是清華大學 THUDM 團隊開發的 LLM 後訓練框架，為 GLM-4.5、GLM-4.6 和 GLM-4.7 提供支持。它將用於訓練的 Megatron-LM 與用於高吞吐量 rollout 生成的 SGLang 連接起來。

何時使用 slime

在以下情況選擇 slime：

需要 Megatron-LM 原生訓練與 SGLang 推理結合
需要具有靈活數據緩衝區的自定義數據生成工作流
訓練 GLM、Qwen3、DeepSeek V3 或 Llama 3 模型
需要具有生產級支持（Z.ai）的研究級框架

在以下情況考慮替代方案：

需要企業級穩定性功能 → 使用 miles
希望靈活切換後端 → 使用 verl
需要 PyTorch 原生抽象 → 使用 torchforge

主要特性

訓練：Megatron-LM，支持完全並行化（TP、PP、DP、SP）
Rollout：基於 SGLang 的高吞吐量生成，帶有路由器
數據緩衝區：靈活的提示管理和樣本存儲
模型：GLM-4.x、Qwen3、DeepSeek V3/R1、Llama 3

架構概述

┌─────────────────────────────────────────────────────────┐
│                    Data Buffer                          │
│ - Prompt initialization and management                  │
│ - Custom data generation and filtering                  │
│ - Rollout sample storage                                │
└─────────────┬───────────────────────────┬───────────────┘
              │                           │
┌─────────────▼───────────┐ ┌─────────────▼───────────────┐
│ Training (Megatron-LM)  │ │ Rollout (SGLang + Router)   │
│ - Actor model training  │ │ - Response generation       │
│ - Critic (optional)     │ │ - Reward/verifier output    │
│ - Weight sync to rollout│ │ - Multi-turn support        │
└─────────────────────────┘ └─────────────────────────────┘

安裝

# Recommended: Docker
docker pull slimerl/slime:latest
docker run --rm --gpus all --ipc=host --shm-size=16g \
  -it slimerl/slime:latest /bin/bash

# Inside container
cd /root/slime && pip install -e . --no-deps

從源碼安裝

git clone https://github.com/THUDM/slime.git
cd slime
pip install -r requirements.txt
pip install -e .

快速開始：GRPO 訓練

# Source model configuration
source scripts/models/qwen3-4B.sh

# Launch training
python train.py \
    --actor-num-nodes 1 \
    --actor-num-gpus-per-node 4 \
    --rollout-num-gpus 4 \
    --advantage-estimator grpo \
    --use-kl-loss --kl-loss-coef 0.001 \
    --rollout-batch-size 32 \
    --n-samples-per-prompt 8 \
    --global-batch-size 256 \
    --num-rollout 3000 \
    --prompt-data /path/to/data.jsonl \
    ${MODEL_ARGS[@]} ${CKPT_ARGS[@]}

工作流 1：標準 GRPO 訓練

使用此工作流訓練具有組相對優勢（group-relative advantages）的推理模型。

先決條件檢查清單

Docker 環境或已安裝 Megatron-LM + SGLang
模型檢查點（HuggingFace 或 Megatron 格式）
JSONL 格式的訓練數據

步驟 1：準備數據

# data.jsonl format
{"prompt": "What is 2 + 2?", "label": "4"}
{"prompt": "Solve: 3x = 12", "label": "x = 4"}

或使用聊天格式：

{
    "prompt": [
        {"role": "system", "content": "You are a math tutor."},
        {"role": "user", "content": "What is 15 + 27?"}
    ],
    "label": "42"
}

步驟 2：配置模型

選擇預配置的模型腳本：

# List available models
ls scripts/models/
# glm4-9B.sh, qwen3-4B.sh, qwen3-30B-A3B.sh, deepseek-v3.sh, llama3-8B.sh, ...

# Source your model
source scripts/models/qwen3-4B.sh

步驟 3：啟動訓練

python train.py \
    --actor-num-nodes 1 \
    --actor-num-gpus-per-node 8 \
    --rollout-num-gpus 8 \
    --advantage-estimator grpo \
    --use-kl-loss \
    --kl-loss-coef 0.001 \
    --prompt-data /path/to/train.jsonl \
    --input-key prompt \
    --label-key label \
    --apply-chat-template \
    --rollout-batch-size 32 \
    --n-samples-per-prompt 8 \
    --global-batch-size 256 \
    --num-rollout 3000 \
    --save-interval 100 \
    --eval-interval 50 \
    ${MODEL_ARGS[@]}

步驟 4：監控訓練

檢查 TensorBoard：tensorboard --logdir outputs/
驗證獎勵曲線是否呈上升趨勢
監控各節點的 GPU 利用率

工作流 2：異步訓練

通過重疊 rollout 和訓練過程，使用異步模式以獲得更高的吞吐量。

何時使用異步模式

具有較長生成時間的大型模型
同步模式下 GPU 空閒時間較高
有足夠的內存用於緩衝

啟動異步訓練

python train_async.py \
    --actor-num-nodes 1 \
    --actor-num-gpus-per-node 8 \
    --rollout-num-gpus 8 \
    --advantage-estimator grpo \
    --async-buffer-size 4 \
    --prompt-data /path/to/train.jsonl \
    ${MODEL_ARGS[@]}

異步特定參數

--async-buffer-size 4        # Number of rollouts to buffer
--update-weights-interval 2  # Sync weights every N rollouts

工作流 3：多輪 Agent 訓練

使用此工作流訓練具有工具使用或多步推理能力的 Agent。

先決條件

用於多輪邏輯的自定義生成函數
工具/環境接口

步驟 1：定義自定義生成函數

# custom_generate.py
async def custom_generate(args, samples, evaluation=False):
    """Multi-turn generation with tool calling."""
    for sample in samples:
        conversation = sample.prompt

        for turn in range(args.max_turns):
            # Generate response
            response = await generate_single(conversation)

            # Check for tool call
            tool_call = extract_tool_call(response)
            if tool_call:
                tool_result = execute_tool(tool_call)
                conversation.append({"role": "assistant", "content": response})
                conversation.append({"role": "tool", "content": tool_result})
            else:
                break

        sample.response = response
        sample.reward = compute_reward(sample)

    return samples

步驟 2：使用自定義函數啟動

python train.py \
    --custom-generate-function-path custom_generate.py \
    --max-turns 5 \
    --prompt-data /path/to/agent_data.jsonl \
    ${MODEL_ARGS[@]}

參見 examples/search-r1/ 獲取完整的多輪搜索示例。

配置參考

三類參數

slime 使用三種類型的參數：

1. Megatron 參數（直接傳遞）：

--tensor-model-parallel-size 2
--pipeline-model-parallel-size 1
--num-layers 32
--hidden-size 4096

2. SGLang 參數（前綴為 --sglang-）：

--sglang-mem-fraction-static 0.8
--sglang-context-length 8192
--sglang-log-level INFO

3. slime 參數：

# Resource allocation
--actor-num-nodes 1
--actor-num-gpus-per-node 8
--rollout-num-gpus 8
--colocate  # Share GPUs between training/inference

# Data
--prompt-data /path/to/data.jsonl
--input-key prompt
--label-key label

# Training loop
--num-rollout 3000
--rollout-batch-size 32
--n-samples-per-prompt 8
--global-batch-size 256

# Algorithm
--advantage-estimator grpo  # or: gspo, ppo, reinforce_plus_plus
--use-kl-loss
--kl-loss-coef 0.001

關鍵約束

rollout_batch_size × n_samples_per_prompt = global_batch_size × num_steps_per_rollout

示例：32 × 8 = 256 × 1

數據緩衝區系統

slime 的數據緩衝區支持靈活的數據管理：

基本數據源

class RolloutDataSource:
    def get_samples(self, num_samples):
        """Fetch prompts from dataset."""
        return self.dataset.sample(num_samples)

    def add_samples(self, samples):
        """Called after generation (no-op by default)."""
        pass

緩衝數據源（Off-Policy）

class RolloutDataSourceWithBuffer(RolloutDataSource):
    def __init__(self):
        self.buffer = []

    def add_samples(self, samples):
        """Store generated samples for reuse."""
        self.buffer.extend(samples)

    def buffer_filter(self, args, buffer, num_samples):
        """Custom selection logic (prioritized, stratified, etc.)."""
        return select_best(buffer, num_samples)

常見問題及解決方案

問題：SGLang 引擎崩潰

症狀：推理引擎在訓練中途停止運行

解決方案：

# Enable fault tolerance
--use-fault-tolerance

# Increase memory allocation
--sglang-mem-fraction-static 0.85

# Reduce batch size
--rollout-batch-size 16

問題：權重同步超時

症狀：rollout 後訓練掛起

解決方案：

# Increase sync interval
--update-weights-interval 5

# Use colocated mode (no network transfer)
--colocate

問題：訓練期間 OOM（內存溢出）

症狀：反向傳播期間出現 CUDA OOM

解決方案：

# Enable gradient checkpointing
--recompute-activations

# Reduce micro-batch size
--micro-batch-size 1

# Enable sequence parallelism
--sequence-parallel

問題：數據加載緩慢

症狀：數據獲取期間 GPU 空閒

解決方案：

# Increase data workers
--num-data-workers 4

# Use streaming dataset
--streaming-data

支持的模型

模型系列	配置
GLM	GLM-4.5, GLM-4.6, GLM-4.7, GLM-Z1-9B
Qwen	Qwen3 (4B, 8B, 30B-A3B), Qwen3-MoE, Qwen2.5
DeepSeek	V3, V3.1, R1
Llama	Llama 3 (8B, 70B)
其他	Kimi K2, Moonlight-16B

每個模型在 scripts/models/ 中都有預配置的腳本。

高級主題

共置模式

在訓練和推理之間共享 GPU 以減少內存佔用：

python train.py \
    --colocate \
    --actor-num-gpus-per-node 8 \
    --sglang-mem-fraction-static 0.4 \
    ${MODEL_ARGS[@]}

自定義獎勵模型

# custom_rm.py
class CustomRewardModel:
    def __init__(self, model_path):
        self.model = load_model(model_path)

    def compute_reward(self, prompts, responses):
        inputs = self.tokenize(prompts, responses)
        scores = self.model(inputs)
        return scores.tolist()

--custom-rm-path custom_rm.py

多任務評估

--eval-prompt-data aime /path/to/aime.jsonl \
--eval-prompt-data gsm8k /path/to/gsm8k.jsonl \
--n-samples-per-eval-prompt 16

資源

文檔：https://thudm.github.io/slime/
GitHub：https://github.com/THUDM/slime
博客：https://lmsys.org/blog/2025-07-09-slime/
示例：參見 examples/ 目錄，包含 14+ 個完整示例

技能元數據​

參考：完整 SKILL.md​

slime：用於 RL 擴展的 LLM 後訓練框架

何時使用 slime​

主要特性​

架構概述​

安裝​

從源碼安裝​

快速開始：GRPO 訓練​

工作流 1：標準 GRPO 訓練​

先決條件檢查清單​

步驟 1：準備數據​

步驟 2：配置模型​

步驟 3：啟動訓練​

步驟 4：監控訓練​

工作流 2：異步訓練​

何時使用異步模式​

啟動異步訓練​

異步特定參數​

工作流 3：多輪 Agent 訓練​

先決條件​

步驟 1：定義自定義生成函數​

步驟 2：使用自定義函數啟動​

配置參考​

三類參數​

關鍵約束​

數據緩衝區系統​

基本數據源​

緩衝數據源（Off-Policy）​

常見問題及解決方案​

問題：SGLang 引擎崩潰​

問題：權重同步超時​

問題：訓練期間 OOM（內存溢出）​

問題：數據加載緩慢​

支持的模型​

高級主題​

共置模式​

自定義獎勵模型​

多任務評估​

資源​

技能元數據

參考：完整 SKILL.md

何時使用 slime

主要特性

架構概述

安裝

從源碼安裝

快速開始：GRPO 訓練

工作流 1：標準 GRPO 訓練

先決條件檢查清單

步驟 1：準備數據

步驟 2：配置模型

步驟 3：啟動訓練

步驟 4：監控訓練

工作流 2：異步訓練

何時使用異步模式

啟動異步訓練

異步特定參數

工作流 3：多輪 Agent 訓練

先決條件

步驟 1：定義自定義生成函數

步驟 2：使用自定義函數啟動

配置參考

三類參數

關鍵約束

數據緩衝區系統

基本數據源

緩衝數據源（Off-Policy）

常見問題及解決方案

問題：SGLang 引擎崩潰

問題：權重同步超時

問題：訓練期間 OOM（內存溢出）

問題：數據加載緩慢

支持的模型

高級主題

共置模式

自定義獎勵模型

多任務評估

資源