評估 LLMs Harness

在 60 多個學術基準（MMLU、HumanEval、GSM8K、TruthfulQA、HellaSwag）上評估大型語言模型 (LLM)。適用於基準測試模型質量、比較模型、報告學術結果或跟蹤訓練進度。這是 EleutherAI、HuggingFace 和主要實驗室使用的行業標準。支持 HuggingFace、vLLM 和 API。

技能元數據


來源	捆綁（默認安裝）
路徑	`skills/mlops/evaluation/lm-evaluation-harness`
版本	`1.0.0`
作者	Orchestra Research
許可證	MIT
依賴項	`lm-eval`, `transformers`, `vllm`
標籤	`Evaluation`, `LM Evaluation Harness`, `Benchmarking`, `MMLU`, `HumanEval`, `GSM8K`, `EleutherAI`, `Model Quality`, `Academic Benchmarks`, `Industry Standard`

參考：完整 SKILL.md

信息

以下是 Hermes 在觸發此技能時加載的完整技能定義。這是技能激活時代理看到的指令。

lm-evaluation-harness - LLM 基準測試

快速開始

lm-evaluation-harness 使用標準化的提示和指標，在 60 多個學術基準上評估 LLM。

安裝：

pip install lm-eval

評估任意 HuggingFace 模型：

lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf \
  --tasks mmlu,gsm8k,hellaswag \
  --device cuda:0 \
  --batch_size 8

查看可用任務：

lm_eval --tasks list

常見工作流

工作流 1：標準基準評估

在核心基準（MMLU、GSM8K、HumanEval）上評估模型。

複製此檢查清單：

Benchmark Evaluation:
- [ ] Step 1: Choose benchmark suite
- [ ] Step 2: Configure model
- [ ] Step 3: Run evaluation
- [ ] Step 4: Analyze results

步驟 1：選擇基準套件

核心推理基準：

MMLU（大規模多任務語言理解）- 57 個學科，多項選擇題
GSM8K - 小學數學應用題
HellaSwag - 常識推理
TruthfulQA - 真實性和事實性
ARC（AI2 推理挑戰）- 科學問題

代碼基準：

HumanEval - Python 代碼生成（164 個問題）
MBPP（大多數基礎 Python 問題）- Python 編程

標準套件（推薦用於模型發佈）：

--tasks mmlu,gsm8k,hellaswag,truthfulqa,arc_challenge

步驟 2：配置模型

HuggingFace 模型：

lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf,dtype=bfloat16 \
  --tasks mmlu \
  --device cuda:0 \
  --batch_size auto  # Auto-detect optimal batch size

量化模型（4-bit/8-bit）：

lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf,load_in_4bit=True \
  --tasks mmlu \
  --device cuda:0

自定義檢查點：

lm_eval --model hf \
  --model_args pretrained=/path/to/my-model,tokenizer=/path/to/tokenizer \
  --tasks mmlu \
  --device cuda:0

步驟 3：運行評估

# Full MMLU evaluation (57 subjects)
lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf \
  --tasks mmlu \
  --num_fewshot 5 \  # 5-shot evaluation (standard)
  --batch_size 8 \
  --output_path results/ \
  --log_samples  # Save individual predictions

# Multiple benchmarks at once
lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf \
  --tasks mmlu,gsm8k,hellaswag,truthfulqa,arc_challenge \
  --num_fewshot 5 \
  --batch_size 8 \
  --output_path results/llama2-7b-eval.json

步驟 4：分析結果

結果保存至 results/llama2-7b-eval.json：

{
  "results": {
    "mmlu": {
      "acc": 0.459,
      "acc_stderr": 0.004
    },
    "gsm8k": {
      "exact_match": 0.142,
      "exact_match_stderr": 0.006
    },
    "hellaswag": {
      "acc_norm": 0.765,
      "acc_norm_stderr": 0.004
    }
  },
  "config": {
    "model": "hf",
    "model_args": "pretrained=meta-llama/Llama-2-7b-hf",
    "num_fewshot": 5
  }
}

工作流 2：跟蹤訓練進度

在訓練期間評估檢查點。

Training Progress Tracking:
- [ ] Step 1: Set up periodic evaluation
- [ ] Step 2: Choose quick benchmarks
- [ ] Step 3: Automate evaluation
- [ ] Step 4: Plot learning curves

步驟 1：設置定期評估

每 N 個訓練步驟進行評估：

#!/bin/bash
# eval_checkpoint.sh

CHECKPOINT_DIR=$1
STEP=$2

lm_eval --model hf \
  --model_args pretrained=$CHECKPOINT_DIR/checkpoint-$STEP \
  --tasks gsm8k,hellaswag \
  --num_fewshot 0 \  # 0-shot for speed
  --batch_size 16 \
  --output_path results/step-$STEP.json

步驟 2：選擇快速基準

用於頻繁評估的快速基準：

HellaSwag：在 1 個 GPU 上約需 10 分鐘
GSM8K：約需 5 分鐘
PIQA：約需 2 分鐘

避免用於頻繁評估（太慢）：

MMLU：約需 2 小時（57 個學科）
HumanEval：需要代碼執行

步驟 3：自動化評估

與訓練腳本集成：

# In training loop
if step % eval_interval == 0:
    model.save_pretrained(f"checkpoints/step-{step}")

    # Run evaluation
    os.system(f"./eval_checkpoint.sh checkpoints step-{step}")

或使用 PyTorch Lightning 回調：

from pytorch_lightning import Callback

class EvalHarnessCallback(Callback):
    def on_validation_epoch_end(self, trainer, pl_module):
        step = trainer.global_step
        checkpoint_path = f"checkpoints/step-{step}"

        # Save checkpoint
        trainer.save_checkpoint(checkpoint_path)

        # Run lm-eval
        os.system(f"lm_eval --model hf --model_args pretrained={checkpoint_path} ...")

步驟 4：繪製學習曲線

import json
import matplotlib.pyplot as plt

# Load all results
steps = []
mmlu_scores = []

for file in sorted(glob.glob("results/step-*.json")):
    with open(file) as f:
        data = json.load(f)
        step = int(file.split("-")[1].split(".")[0])
        steps.append(step)
        mmlu_scores.append(data["results"]["mmlu"]["acc"])

# Plot
plt.plot(steps, mmlu_scores)
plt.xlabel("Training Step")
plt.ylabel("MMLU Accuracy")
plt.title("Training Progress")
plt.savefig("training_curve.png")

工作流 3：比較多個模型

用於模型比較的基準套件。

Model Comparison:
- [ ] Step 1: Define model list
- [ ] Step 2: Run evaluations
- [ ] Step 3: Generate comparison table

步驟 1：定義模型列表

# models.txt
meta-llama/Llama-2-7b-hf
meta-llama/Llama-2-13b-hf
mistralai/Mistral-7B-v0.1
microsoft/phi-2

步驟 2：運行評估

#!/bin/bash
# eval_all_models.sh

TASKS="mmlu,gsm8k,hellaswag,truthfulqa"

while read model; do
    echo "Evaluating $model"

    # Extract model name for output file
    model_name=$(echo $model | sed 's/\//-/g')

    lm_eval --model hf \
      --model_args pretrained=$model,dtype=bfloat16 \
      --tasks $TASKS \
      --num_fewshot 5 \
      --batch_size auto \
      --output_path results/$model_name.json

done < models.txt

步驟 3：生成比較表

import json
import pandas as pd

models = [
    "meta-llama-Llama-2-7b-hf",
    "meta-llama-Llama-2-13b-hf",
    "mistralai-Mistral-7B-v0.1",
    "microsoft-phi-2"
]

tasks = ["mmlu", "gsm8k", "hellaswag", "truthfulqa"]

results = []
for model in models:
    with open(f"results/{model}.json") as f:
        data = json.load(f)
        row = {"Model": model.replace("-", "/")}
        for task in tasks:
            # Get primary metric for each task
            metrics = data["results"][task]
            if "acc" in metrics:
                row[task.upper()] = f"{metrics['acc']:.3f}"
            elif "exact_match" in metrics:
                row[task.upper()] = f"{metrics['exact_match']:.3f}"
        results.append(row)

df = pd.DataFrame(results)
print(df.to_markdown(index=False))

輸出：

| Model                  | MMLU  | GSM8K | HELLASWAG | TRUTHFULQA |
|------------------------|-------|-------|-----------|------------|
| meta-llama/Llama-2-7b  | 0.459 | 0.142 | 0.765     | 0.391      |
| meta-llama/Llama-2-13b | 0.549 | 0.287 | 0.801     | 0.430      |
| mistralai/Mistral-7B   | 0.626 | 0.395 | 0.812     | 0.428      |
| microsoft/phi-2        | 0.560 | 0.613 | 0.682     | 0.447      |

工作流 4：使用 vLLM 評估（更快的推理）

使用 vLLM 後端進行速度快 5-10 倍的評估。

vLLM Evaluation:
- [ ] Step 1: Install vLLM
- [ ] Step 2: Configure vLLM backend
- [ ] Step 3: Run evaluation

步驟 1：安裝 vLLM

pip install vllm

步驟 2：配置 vLLM 後端

lm_eval --model vllm \
  --model_args pretrained=meta-llama/Llama-2-7b-hf,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.8 \
  --tasks mmlu \
  --batch_size auto

步驟 3：運行評估

vLLM 比標準 HuggingFace 快 5-10 倍：

# Standard HF: ~2 hours for MMLU on 7B model
lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf \
  --tasks mmlu \
  --batch_size 8

# vLLM: ~15-20 minutes for MMLU on 7B model
lm_eval --model vllm \
  --model_args pretrained=meta-llama/Llama-2-7b-hf,tensor_parallel_size=2 \
  --tasks mmlu \
  --batch_size auto

何時使用 vs 替代方案

在以下情況使用 lm-evaluation-harness：

為學術論文基準測試模型
在標準任務間比較模型質量
跟蹤訓練進度
報告標準化指標（每個人都使用相同的提示）
需要可復現的評估

改用替代方案：

HELM（斯坦福）：更廣泛的評估（公平性、效率、校準）
AlpacaEval：帶有 LLM 評委的指令遵循評估
MT-Bench：對話式多輪評估
自定義腳本：特定領域的評估

常見問題

問題：評估太慢

使用 vLLM 後端：

lm_eval --model vllm \
  --model_args pretrained=model-name,tensor_parallel_size=2

或減少 few-shot 示例數量：

--num_fewshot 0  # Instead of 5

或評估 MMLU 的子集：

--tasks mmlu_stem  # Only STEM subjects

問題：內存不足

減小批量大小：

--batch_size 1  # Or --batch_size auto

使用量化：

--model_args pretrained=model-name,load_in_8bit=True

啟用 CPU 卸載：

--model_args pretrained=model-name,device_map=auto,offload_folder=offload

問題：結果與報告不符

檢查 few-shot 數量：

--num_fewshot 5  # Most papers use 5-shot

檢查確切的任務名稱：

--tasks mmlu  # Not mmlu_direct or mmlu_fewshot

驗證模型和分詞器是否匹配：

--model_args pretrained=model-name,tokenizer=same-model-name

問題：HumanEval 未執行代碼

安裝執行依賴項：

pip install human-eval

啟用代碼執行：

lm_eval --model hf \
  --model_args pretrained=model-name \
  --tasks humaneval \
  --allow_code_execution  # Required for HumanEval

高級主題

基準測試描述：請參閱 references/benchmark-guide.md，獲取所有 60 多個任務的詳細描述、它們的測量指標以及結果解讀。

自定義任務：請參閱 references/custom-tasks.md，瞭解如何創建特定領域的評估任務。

API 評估：請參閱 references/api-evaluation.md，瞭解如何評估 OpenAI、Anthropic 和其他 API 模型。

多 GPU 策略：請參閱 references/distributed-eval.md，瞭解數據並行和張量並行評估。

硬件要求

GPU：NVIDIA（CUDA 11.8+），可在 CPU 上運行（速度非常慢）
顯存 (VRAM)：
- 7B 模型：16GB (bf16) 或 8GB (8-bit)
- 13B 模型：28GB (bf16) 或 14GB (8-bit)
- 70B 模型：需要多 GPU 或量化
時間（7B 模型，單張 A100）：
- HellaSwag：10 分鐘
- GSM8K：5 分鐘
- MMLU（完整）：2 小時
- HumanEval：20 分鐘

資源

GitHub: https://github.com/EleutherAI/lm-evaluation-harness
文檔: https://github.com/EleutherAI/lm-evaluation-harness/tree/main/docs
任務庫：60 多個任務，包括 MMLU、GSM8K、HumanEval、TruthfulQA、HellaSwag、ARC、WinoGrande 等。
排行榜: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard（使用此評估框架）

技能元數據​

參考：完整 SKILL.md​

lm-evaluation-harness - LLM 基準測試

快速開始​

常見工作流​

工作流 1：標準基準評估​

工作流 2：跟蹤訓練進度​

工作流 3：比較多個模型​

工作流 4：使用 vLLM 評估（更快的推理）​

何時使用 vs 替代方案​

常見問題​

高級主題​

硬件要求​

資源​