Hermes Atropos 環境

構建、測試和調試用於 Atropos 訓練的 Hermes Agent RL（強化學習）環境。涵蓋 HermesAgentBaseEnv 接口、獎勵函數、智能體循環集成、使用工具進行評估、wandb 日誌記錄以及三種 CLI 模式（serve/process/evaluate）。在 hermes-agent 倉庫中創建、審查或修復 RL 環境時使用。

技能元數據


來源	可選 — 使用 `hermes skills install official/mlops/hermes-atropos-environments` 安裝
路徑	`optional-skills/mlops/hermes-atropos-environments`
版本	`1.1.0`
作者	Hermes Agent
許可證	MIT
標籤	`atropos`, `rl`, `environments`, `training`, `reinforcement-learning`, `reward-functions`
相關技能	`axolotl`, `fine-tuning-with-trl`, `lm-evaluation-harness`

參考：完整 SKILL.md

信息

以下是 Hermes 在觸發此技能時加載的完整技能定義。這是技能激活時智能體看到的指令。

Hermes Agent Atropos 環境

在 hermes-agent 倉庫中構建與 Atropos 訓練框架集成的 RL 環境的指南。

架構概述

Atropos BaseEnv (atroposlib/envs/base.py)
    └── HermesAgentBaseEnv (environments/hermes_base_env.py)
            ├── Handles agent loop orchestration
            ├── Handles tool resolution per group
            ├── Handles ToolContext for reward verification
            └── YOUR ENVIRONMENT (environments/your_env.py)
                    Only implements: setup, get_next_item, format_prompt,
                                    compute_reward, evaluate, wandb_log

Hermes 環境的特殊之處在於它們運行的是帶有工具調用的多輪智能體循環，而不僅僅是單輪補全。基礎環境處理該循環；你只需實現任務和評分邏輯。

文件位置

文件	用途
`environments/hermes_base_env.py`	包含智能體循環 + 工具解析的基類
`environments/agent_loop.py`	`HermesAgentLoop` + `AgentResult` 數據類
`environments/tool_context.py`	用於獎勵驗證的 `ToolContext`
`environments/tool_call_parsers.py`	第二階段工具調用解析器（hermes、mistral 等）
`environments/your_env.py`	你的環境實現

推理設置 — 首先詢問用戶

重要： 在運行任何測試、評估或數據生成命令之前，務必詢問用戶希望如何處理推理。不要假設使用 OpenRouter 或任何特定端點。提供以下選項：

OpenRouter — 詢問他們想要使用的模型（例如 anthropic/claude-sonnet-4.5、google/gemini-2.5-pro、meta-llama/llama-3.3-70b-instruct 等）。需要在環境中設置 OPENROUTER_API_KEY。
自託管 VLLM 端點 — 詢問其基礎 URL（例如 http://localhost:8000/v1）和模型名稱。設置 --openai.server_type vllm。
其他兼容 OpenAI 的 API — 詢問基礎 URL、模型名稱以及任何所需的 API 密鑰。設置 --openai.server_type openai 和 --openai.health_check false。
本地 Atropos 訓練服務器 — 用於帶有實時訓練循環的 serve 模式。默認為 http://localhost:8000/v1。

一旦用戶告知你其設置，請在該會話的所有 CLI 命令中使用這些值。示例提示：

“在運行此命令之前，你希望如何處理推理？

OpenRouter（我需要你首選的模型，例如 claude-sonnet-4.5）

自託管 VLLM 端點（給我 URL 和模型名稱）

其他兼容 OpenAI 的 API（給我 URL、模型和任何身份驗證詳細信息）

本地 Atropos 訓練服務器（serve 模式）”

各提供商的關鍵標誌：

提供商	`--openai.server_type`	`--openai.health_check`	`--openai.api_key`
OpenRouter	`openai`	`false`	`$OPENROUTER_API_KEY`
VLLM（自託管）	`vllm`	（默認）	（不需要）
其他兼容 OpenAI	`openai`	`false`	按需提供
本地 Atropos	（默認）	（默認）	（不需要）

必需方法

1. `setup()` — 加載數據集並初始化狀態

async def setup(self) -> None:
    """Called once at startup. Load datasets, initialize state."""
    # Try HuggingFace first, fallback to built-in samples
    try:
        from datasets import load_dataset
        ds = load_dataset("your/dataset", split="test")
        self._items = [...]
    except Exception:
        self._items = BUILTIN_SAMPLES

    # Always split into train/eval
    random.shuffle(self._items)
    eval_size = max(20, int(len(self._items) * 0.1))
    self._eval_items = self._items[:eval_size]
    self._items = self._items[eval_size:]

2. `get_next_item()` — 返回下一個訓練項

async def get_next_item(self) -> dict:
    """Return next item, cycling through dataset."""
    item = self._items[self._index % len(self._items)]
    self._index += 1
    return item

3. `format_prompt(item)` — 將項轉換為用戶消息

def format_prompt(self, item: dict) -> str:
    """Convert a dataset item into the user-facing prompt."""
    return f"Research this question: {item['question']}"

4. `compute_reward(item, result, ctx)` — 對 rollout 進行評分

關鍵：result 是一個 AgentResult，不是字典。它具有以下屬性：

result.messages — 消息字典列表（OpenAI 格式）
result.turns_used — 進行的 LLM 調用次數
result.finished_naturally — 如果模型自願停止，則為 True
result.tool_errors — ToolError 對象列表

AgentResult 不具有：final_response、tool_calls、tools_used。你必須從 result.messages 中提取這些信息：

async def compute_reward(self, item, result: AgentResult, ctx: ToolContext) -> float:
    # Extract final response (last assistant message with content)
    final_response = ""
    tools_used = []
    for msg in reversed(result.messages):
        if msg.get("role") == "assistant" and msg.get("content") and not final_response:
            final_response = msg["content"]
        if msg.get("role") == "assistant" and msg.get("tool_calls"):
            for tc in msg["tool_calls"]:
                fn = tc.get("function", {}) if isinstance(tc, dict) else {}
                name = fn.get("name", "")
                if name:
                    tools_used.append(name)

    # Score using LLM judge, heuristic, or ToolContext verification
    correctness = await self._llm_judge(item, final_response)
    return correctness

ctx（ToolContext）為你提供對智能體沙箱的終端/文件訪問權限以進行驗證：

# Run tests in the agent's sandbox
result = ctx.terminal("pytest /workspace/test.py")
return 1.0 if result["exit_code"] == 0 else 0.0

5. `evaluate()` — 使用完整智能體循環進行定期評估

必須使用帶有工具的完整智能體循環，而不是單輪 chat_completion。 hermes-agent 環境的核心要點在於智能體評估：

async def evaluate(self, *args, **kwargs) -> None:
    import time, uuid
    from environments.agent_loop import HermesAgentLoop
    from environments.tool_context import ToolContext

    start_time = time.time()
    tools, valid_names = self._resolve_tools_for_group()
    samples = []

    for item in self._eval_items[:self.config.eval_size]:
        task_id = str(uuid.uuid4())
        messages = []
        if self.config.system_prompt:
            messages.append({"role": "system", "content": self.config.system_prompt})
        messages.append({"role": "user", "content": self.format_prompt(item)})

        agent = HermesAgentLoop(
            server=self.server,
            tool_schemas=tools,
            valid_tool_names=valid_names,
            max_turns=self.config.max_agent_turns,
            task_id=task_id,
            temperature=0.0,  # Deterministic for eval
            max_tokens=self.config.max_token_length,
            extra_body=self.config.extra_body,
        )
        result = await agent.run(messages)

        ctx = ToolContext(task_id)
        try:
            reward = await self.compute_reward(item, result, ctx)
        finally:
            ctx.cleanup()

        samples.append({"prompt": ..., "response": ..., "reward": reward})

    eval_metrics = {"eval/mean_reward": ...}
    await self.evaluate_log(metrics=eval_metrics, samples=samples,
                            start_time=start_time, end_time=time.time())

6. `wandb_log()` — 自定義指標日誌記錄

始終在最後調用 super().wandb_log()：

async def wandb_log(self, wandb_metrics=None):
    if wandb_metrics is None:
        wandb_metrics = {}
    if self._reward_buffer:
        n = len(self._reward_buffer)
        wandb_metrics["train/mean_reward"] = sum(self._reward_buffer) / n
        self._reward_buffer.clear()
    await super().wandb_log(wandb_metrics)  # MUST call super

陷阱：compute_reward 會向指標緩衝區追加數據。在評估期間，這會汙染訓練指標。請回滾在評估期間添加的緩衝區條目。

Config 類

始終使用 Pydantic Field 描述符創建自定義配置子類。你可以調整的關鍵繼承字段包括：enabled_toolsets、max_agent_turns、agent_temperature、system_prompt、terminal_backend、group_size、steps_per_eval、total_steps。

config_init() — 默認配置

類方法，返回 (YourEnvConfig, [APIServerConfig(...)])。將 server_type 設置為 "openai" 以用於 OpenRouter/外部 API。從環境變量加載 API 密鑰。

三種 CLI 模式

# SERVE — Full training loop (connects to Atropos API server)
python environments/my_env.py serve --openai.base_url http://localhost:8000/v1

# PROCESS — Offline data generation (saves JSONL)
python environments/my_env.py process --env.total_steps 10 --env.group_size 1 \
    --env.use_wandb false --env.data_path_to_save_groups output.jsonl \
    --openai.base_url "<USER_BASE_URL>" \
    --openai.model_name "<USER_MODEL>" \
    --openai.server_type <USER_SERVER_TYPE> --openai.health_check false

# EVALUATE — Standalone eval (runs setup + evaluate only)
python environments/my_env.py evaluate --env.eval_size 20 \
    --env.data_dir_to_save_evals /tmp/eval_results \
    --openai.base_url "<USER_BASE_URL>" \
    --openai.model_name "<USER_MODEL>" \
    --openai.server_type <USER_SERVER_TYPE> --openai.health_check false

配置優先級：CLI 參數 > YAML 文件 > config_init() 默認值。

常見陷阱

AgentResult 擁有 .messages 而非 .final_response — 通過逆向迭代 result.messages 並查找最後一個包含內容的 assistant 消息來提取最終響應。
evaluate() 必須使用 HermesAgentLoop，而非 chat_completion — 單輪 chat_completion 不支持工具。hermes-agent 基準測試的核心在於帶有工具使用的代理式評估。
不要調用 _llm_judge 兩次 — 如果 compute_reward 已經調用了它，請從緩衝區中提取分數，而不是在 evaluate() 中單獨調用 judge。
評估會汙染訓練緩衝區 — compute_reward 會向指標緩衝區追加數據。在評估期間，回滾緩衝區條目以保持訓練指標乾淨。
對於 OpenRouter 始終設置 health_check=false — OpenRouter 沒有 /health 端點。
在評估模式下設置 data_dir_to_save_evals — 否則結果不會被保存。
default_toolsets 類變量與 enabled_toolsets 配置 — 類變量僅作為提示；配置字段才是實際控制工具解析的因素。
消息中的工具調用解析 — 工具調用是形如 {"function": {"name": ..., "arguments": ...}} 的字典。始終檢查 isinstance(tc, dict)。
ToolContext.cleanup() — 始終在 finally 塊中調用以釋放沙箱資源。
對於外部 API，server_type 必須為 "openai" — 否則，Atropos 會假設使用的是本地 VLLM 服務器。
始終詢問用戶的推理設置 — 切勿硬編碼或假設特定的提供商/模型。參見上方的“推理設置”部分。

獎勵函數模式

LLM 裁判（用於開放式任務）

使用 self.server.chat_completion() 配合評分提示詞。解析 JSON 響應以獲取浮點數分數。當裁判調用失敗時，始終包含一個啟發式回退方案（關鍵詞重疊）。

二元驗證（用於代碼/終端任務）

使用 ctx.terminal("pytest test.py -q") 在代理的沙箱中運行測試。通過返回 1.0，失敗返回 0.0。

多信號（組合多個指標）

加權正確性 (0.6) + 工具使用 (0.2) + 效率 (0.2) + 可選獎勵。限制範圍在 [0, 1]。

測試你的環境

導入測試：python -c "from environments.my_env import MyEnv; print('OK')"
詢問用戶的推理設置（參見上方的“推理設置”部分）
處理模式（1 個項目）：驗證 JSONL 輸出具有有效的 token、掩碼和分數
評估模式：驗證完整的代理循環是否隨工具一起運行，且指標記錄正確
檢查獎勵範圍：分數應在 [0, 1] 範圍內，且不應全部相同

最小實現清單

class MyEnv(HermesAgentBaseEnv):
    name = "my-env"
    env_config_cls = MyEnvConfig

    @classmethod
    def config_init(cls): ...          # Default server + env config
    async def setup(self): ...         # Load dataset + train/eval split
    async def get_next_item(self): ... # Cycle through training items
    def format_prompt(self, item): ... # Item → user message string
    async def compute_reward(self, item, result, ctx): ...  # Score rollout
    async def evaluate(self, *args, **kwargs): ...  # Full agent loop eval
    async def wandb_log(self, metrics=None): ...    # Custom metrics + super()

if __name__ == "__main__":
    MyEnv.cli()

技能元數據​

參考：完整 SKILL.md​

Hermes Agent Atropos 環境

架構概述​

文件位置​

推理設置 — 首先詢問用戶​

各提供商的關鍵標誌：​

必需方法​

1. setup() — 加載數據集並初始化狀態​

2. get_next_item() — 返回下一個訓練項​

3. format_prompt(item) — 將項轉換為用戶消息​

4. compute_reward(item, result, ctx) — 對 rollout 進行評分​

5. evaluate() — 使用完整智能體循環進行定期評估​

6. wandb_log() — 自定義指標日誌記錄​

Config 類​

config_init() — 默認配置​

三種 CLI 模式​

常見陷阱​

獎勵函數模式​

LLM 裁判（用於開放式任務）​

二元驗證（用於代碼/終端任務）​

多信號（組合多個指標）​

測試你的環境​

最小實現清單​