上下文壓縮與緩存

Hermes Agent 使用雙層壓縮系統和 Anthropic 提示詞緩存機制，以在長時間對話中高效管理上下文窗口的使用。

源文件：agent/context_engine.py（ABC 接口），agent/context_compressor.py（默認引擎），agent/prompt_caching.py，gateway/run.py（會話清理），run_agent.py（搜索 _compress_context）

可插拔的上下文引擎

上下文管理基於 ContextEngine ABC 接口（agent/context_engine.py）。內置的 ContextCompressor 是默認實現，但插件可以替換為其他引擎（例如：無損上下文管理）。

context:
  engine: "compressor"    # 默認值 - 內置有損摘要
  engine: "lcm"           # 示例 - 提供無損 context 的插件

該引擎負責：

決定是否應觸發壓縮（should_compress()）
執行壓縮操作（compress()）
可選地暴露 Agent 可調用的工具（例如：lcm_grep）
跟蹤來自 API 響應的 token 使用情況

選擇由配置驅動，通過 config.yaml 中的 context.engine 進行配置。解析順序如下：

檢查 plugins/context_engine/<name>/ 目錄
檢查通用插件系統（register_context_engine()）
回退到內置的 ContextCompressor

插件引擎不會自動激活——用戶必須顯式將 context.engine 設置為插件名稱。默認值 "compressor" 始終使用內置引擎。

可通過 hermes plugins → 提供商插件 → 上下文引擎進行配置，或直接編輯 config.yaml。

有關開發上下文引擎插件，請參閱上下文引擎插件。

雙層壓縮系統

Hermes 具有兩個獨立運行的壓縮層：

                     ┌──────────────────────────┐
  Incoming message   │   Gateway Session Hygiene │  Fires at 85% of context
  ─────────────────► │   (pre-agent, rough est.) │  Safety net for large sessions
                     └─────────────┬────────────┘
                                   │
                                   ▼
                     ┌──────────────────────────┐
                     │   Agent ContextCompressor │  Fires at 50% of context (default)
                     │   (in-loop, real tokens)  │  Normal context management
                     └──────────────────────────┘

1. 網關會話清理（85% 閾值）

位於 gateway/run.py（搜索 _maybe_compress_session）。這是一個安全網，在 Agent 處理消息前運行。當會話在輪次之間增長過快時（例如 Telegram/Discord 中夜間累積），防止 API 失敗。

閾值：固定為模型上下文長度的 85%
token 來源：優先使用上一輪 API 報告的實際 token 數；若不可用，則回退到粗略的字符估算（estimate_messages_tokens_rough）
觸發條件：僅當 len(history) >= 4 且壓縮功能已啟用時
目的：捕獲逃逸出 Agent 自身壓縮器的會話

網關清理閾值有意高於 Agent 壓縮器的閾值。若設置為 50%（與 Agent 相同），在長會話中會導致每輪都提前壓縮。

2. Agent ContextCompressor（50% 閾值，可配置）

位於 agent/context_compressor.py。這是主要壓縮系統，在 Agent 的工具循環內部運行，並可訪問準確的、由 API 報告的 token 數。

配置

所有壓縮設置均從 config.yaml 中 compression 鍵下讀取：

compression:
  enabled: true              # 啟用/disable compression（默認值：true）
  threshold: 0.50            # context window 的分數（默認值：0.50 = 50%）
  target_ratio: 0.20         # 保留多少閾值作為尾部（默認值：0.20）
  protect_last_n: 20         # 最小受保護尾部消息（默認值：20）
  summary_model: null        # 覆蓋 model 進行摘要（默認：使用輔助）

參數詳情

參數	默認值	範圍	描述
`threshold`	`0.50`	0.0–1.0	當提示詞 token 數 ≥ `threshold × context_length` 時觸發壓縮
`target_ratio`	`0.20`	0.10–0.80	控制尾部保護 token 預算：`threshold_tokens × target_ratio`
`protect_last_n`	`20`	≥1	始終保留的最近消息最小數量
`protect_first_n`	`3`	（硬編碼）	系統提示 + 第一次交互始終保留

計算值（以 200K 上下文模型為例，使用默認值）

context_length       = 200,000
threshold_tokens     = 200,000 × 0.50 = 100,000
tail_token_budget    = 100,000 × 0.20 = 20,000
max_summary_tokens   = min(200,000 × 0.05, 12,000) = 10,000

壓縮算法

ContextCompressor.compress() 方法遵循四階段算法：

階段 1：清除舊工具結果（低成本，無需 LLM 調用）

將超出保護尾部的舊工具結果（>200 字符）替換為：

[Old tool output cleared to save context space]

這是一個低成本的預處理步驟，可顯著節省來自冗長工具輸出（文件內容、終端輸出、搜索結果）的 token。

階段 2：確定邊界

┌─────────────────────────────────────────────────────────────┐
│  Message list                                               │
│                                                             │
│  [0..2]  ← protect_first_n (system + first exchange)        │
│  [3..N]  ← middle turns → SUMMARIZED                        │
│  [N..end] ← tail (by token budget OR protect_last_n)        │
│                                                             │
└─────────────────────────────────────────────────────────────┘

尾部保護基於token 預算：從末尾向後遍歷，累計 token 直到預算耗盡。若預算保護的消息數少於 protect_last_n，則回退到固定數量。

邊界對齊以避免拆分 tool_call/tool_result 組。_align_boundary_backward() 方法會跳過連續的工具結果，找到父級助手消息，確保組完整。

階段 3：生成結構化摘要

中間輪次使用輔助 LLM 和結構化模板進行摘要：

## 目標
[What the user is trying to accomplish]

## 約束與偏好
[User preferences, coding style, constraints, important decisions]

## 進度
### 已完成
[Completed work — specific file paths, commands run, results]
### 進行中
[Work currently underway]
### 阻塞項
[Any blockers or issues encountered]

## 關鍵決策
[Important technical decisions and why]

## 相關文件
[Files read, modified, or created — with brief note on each]

## 下一步
[What needs to happen next]

## 關鍵上下文
[Specific values, error messages, configuration details]

摘要預算隨壓縮內容量動態調整：

公式：content_tokens × 0.20（_SUMMARY_RATIO 常量）
最小值：2,000 token
最大值：min(context_length × 0.05, 12,000) token

階段 4：組裝壓縮後的消息

壓縮後的消息列表包含：

頭部消息（首次壓縮時在系統提示後附加說明）
摘要消息（角色選擇避免連續同角色違規）
尾部消息（保持不變）

未關聯的 tool_call/tool_result 對由 _sanitize_tool_pairs() 進行清理：

引用已刪除調用的工具結果 → 被移除
其結果已被移除的工具調用 → 注入佔位結果

迭代式重壓縮

在後續的壓縮過程中，上一次的摘要會被傳遞給 LLM，並附帶指令以更新該摘要，而非從頭開始重新總結。這能夠保留多次壓縮之間的信息——項目從“進行中”變為“已完成”，新增進展被加入，過時信息被移除。

壓縮器實例上的 _previous_summary 字段用於存儲上一次的摘要文本，以實現此目的。

壓縮前後示例

壓縮前（45 條消息，約 95K 標記）

[0] system:    "You are a helpful assistant..." (system prompt)
[1] user:      "Help me set up a FastAPI project"
[2] assistant: <tool_call> terminal: mkdir project </tool_call>
[3] tool:      "directory created"
[4] assistant: <tool_call> write_file: main.py </tool_call>
[5] tool:      "file written (2.3KB)"
    ... 30 more turns of file editing, testing, debugging ...
[38] assistant: <tool_call> terminal: pytest </tool_call>
[39] tool:      "8 passed, 2 failed\n..."  (5KB output)
[40] user:      "Fix the failing tests"
[41] assistant: <tool_call> read_file: tests/test_api.py </tool_call>
[42] tool:      "import pytest\n..."  (3KB)
[43] assistant: "I see the issue with the test fixtures..."
[44] user:      "Great, also add error handling"

壓縮後（25 條消息，約 45K 標記）

[0] system:    "You are a helpful assistant...
               [Note: Some earlier conversation turns have been compacted...]"
[1] user:      "Help me set up a FastAPI project"
[2] assistant: "[CONTEXT COMPACTION] Earlier turns were compacted...

               ## 目標
               Set up a FastAPI project with tests and error handling

               ## 進度
               ### 已完成
               - Created project structure: main.py, tests/, requirements.txt
               - Implemented 5 API endpoints in main.py
               - Wrote 10 test cases in tests/test_api.py
               - 8/10 tests passing

               ### 進行中
               - Fixing 2 failing tests (test_create_user, test_delete_user)

               ## 相關文件
               - main.py — FastAPI app with 5 endpoints
               - tests/test_api.py — 10 test cases
               - requirements.txt — fastapi, pytest, httpx

               ## 下一步
               - Fix failing test fixtures
               - Add error handling"
[3] user:      "Fix the failing tests"
[4] assistant: <tool_call> read_file: tests/test_api.py </tool_call>
[5] tool:      "import pytest\n..."
[6] assistant: "I see the issue with the test fixtures..."
[7] user:      "Great, also add error handling"

提示詞緩存（Anthropic）

來源：agent/prompt_caching.py

通過緩存對話前綴，將多輪對話的輸入標記成本降低約 75%。使用 Anthropic 的 cache_control 斷點機制。

策略：system_and_3

Anthropic 每個請求最多允許 4 個 cache_control 斷點。Hermes 使用“system_and_3”策略：

Breakpoint 1: System prompt           (stable across all turns)
Breakpoint 2: 3rd-to-last non-system message  ─┐
Breakpoint 3: 2nd-to-last non-system message   ├─ Rolling window
Breakpoint 4: Last non-system message          ─┘

工作原理

apply_anthropic_cache_control() 對消息進行深度複製，並注入 cache_control 標記：

# 緩存標記格式
marker = {"type": "ephemeral"}
# 或者 1 小時 TTL：
marker = {"type": "ephemeral", "ttl": "1h"}

標記的插入位置根據內容類型有所不同：

內容類型	標記插入位置
字符串內容	轉換為 `[{"type": "text", "text": ..., "cache_control": ...}]`
列表內容	添加到最後一個元素的字典中
None/空值	作為 `msg["cache_control"]` 添加
工具消息	作為 `msg["cache_control"]` 添加（僅限原生 Anthropic）

緩存感知設計模式

穩定的系統提示：系統提示為斷點 1，跨所有輪次緩存。避免在對話過程中修改它（壓縮僅在首次壓縮時追加一條備註）。
消息順序至關重要：緩存命中要求前綴匹配。在中間插入或刪除消息會使得之後所有內容的緩存失效。
壓縮與緩存的交互：壓縮後，壓縮區域的緩存被失效，但系統提示緩存得以保留。滾動的 3 條消息窗口可在 1-2 輪內重新建立緩存。
TTL 選擇：默認為 5m（5 分鐘）。對於用戶在輪次間有長時間停頓的長會話，建議使用 1h。

啟用提示詞緩存

當滿足以下條件時，提示詞緩存會自動啟用：

模型為 Anthropic Claude 模型（通過模型名稱檢測）
提供商支持 cache_control（原生 Anthropic API 或 OpenRouter）

# config.yaml — TTL 可配置
model:
  cache_ttl: "5m"   # “0”或“1”

CLI 在啟動時顯示緩存狀態：

💾 Prompt caching: ENABLED (Claude via OpenRouter, 5m TTL)

上下文壓力警告

當使用量達到壓縮閾值的 85% 時（不是上下文總量的 85%，而是閾值本身的 85%，而該閾值本身為上下文總量的 50%），Agent 會發出上下文壓力警告：

⚠️  Context is 85% to compaction threshold (42,500/50,000 tokens)

壓縮後，若使用量降至閾值的 85% 以下，則警告狀態被清除。如果壓縮未能將使用量降至警告水平以下（對話過於密集），警告將持續存在，但壓縮不會再次觸發，直到使用量再次超過閾值。

可插拔的上下文引擎​

雙層壓縮系統​

1. 網關會話清理（85% 閾值）​

2. Agent ContextCompressor（50% 閾值，可配置）​

配置​

參數詳情​

計算值（以 200K 上下文模型為例，使用默認值）​

壓縮算法​

階段 1：清除舊工具結果（低成本，無需 LLM 調用）​

階段 2：確定邊界​

階段 3：生成結構化摘要​

階段 4：組裝壓縮後的消息​

迭代式重壓縮​

壓縮前後示例​

壓縮前（45 條消息，約 95K 標記）​

壓縮後（25 條消息，約 45K 標記）​

提示詞緩存（Anthropic）​

策略：system_and_3​

工作原理​

緩存感知設計模式​

啟用提示詞緩存​

上下文壓力警告​