Chroma

用於 AI 應用的開源嵌入數據庫。存儲嵌入和元數據，執行向量和全文搜索，按元數據過濾。簡單的四函數 API。可從筆記本環境擴展到生產集群。適用於語義搜索、RAG 應用或文檔檢索。最適合本地開發和開源項目。

技能元數據


來源	可選 — 使用 `hermes skills install official/mlops/chroma` 安裝
路徑	`optional-skills/mlops/chroma`
版本	`1.0.0`
作者	Orchestra Research
許可證	MIT
依賴項	`chromadb`, `sentence-transformers`
標籤	`RAG`, `Chroma`, `Vector Database`, `Embeddings`, `Semantic Search`, `Open Source`, `Self-Hosted`, `Document Retrieval`, `Metadata Filtering`

參考：完整 SKILL.md

信息

以下是 Hermes 在觸發此技能時加載的完整技能定義。這是技能激活時代理看到的指令。

Chroma - 開源嵌入數據庫

用於構建具備記憶功能的 LLM 應用的 AI 原生數據庫。

何時使用 Chroma

在以下情況使用 Chroma：

構建 RAG（檢索增強生成）應用
需要本地/自託管向量數據庫
希望使用開源解決方案（Apache 2.0）
在筆記本中進行原型設計
對文檔進行語義搜索
存儲帶有元數據的嵌入

指標：

GitHub 星標 24,300+
Fork 數 1,900+
v1.3.3（穩定版，每週發佈）
Apache 2.0 許可證

改用其他替代方案：

Pinecone：託管雲服務，自動擴縮容
FAISS：純相似度搜索，無元數據支持
Weaviate：面向生產的 ML 原生數據庫
Qdrant：高性能，基於 Rust

快速開始

安裝

# Python
pip install chromadb

# JavaScript/TypeScript
npm install chromadb @chroma-core/default-embed

基本用法 (Python)

import chromadb

# Create client
client = chromadb.Client()

# Create collection
collection = client.create_collection(name="my_collection")

# Add documents
collection.add(
    documents=["This is document 1", "This is document 2"],
    metadatas=[{"source": "doc1"}, {"source": "doc2"}],
    ids=["id1", "id2"]
)

# Query
results = collection.query(
    query_texts=["document about topic"],
    n_results=2
)

print(results)

核心操作

1. 創建集合

# Simple collection
collection = client.create_collection("my_docs")

# With custom embedding function
from chromadb.utils import embedding_functions

openai_ef = embedding_functions.OpenAIEmbeddingFunction(
    api_key="your-key",
    model_name="text-embedding-3-small"
)

collection = client.create_collection(
    name="my_docs",
    embedding_function=openai_ef
)

# Get existing collection
collection = client.get_collection("my_docs")

# Delete collection
client.delete_collection("my_docs")

2. 添加文檔

# Add with auto-generated IDs
collection.add(
    documents=["Doc 1", "Doc 2", "Doc 3"],
    metadatas=[
        {"source": "web", "category": "tutorial"},
        {"source": "pdf", "page": 5},
        {"source": "api", "timestamp": "2025-01-01"}
    ],
    ids=["id1", "id2", "id3"]
)

# Add with custom embeddings
collection.add(
    embeddings=[[0.1, 0.2, ...], [0.3, 0.4, ...]],
    documents=["Doc 1", "Doc 2"],
    ids=["id1", "id2"]
)

3. 查詢（相似度搜索）

# Basic query
results = collection.query(
    query_texts=["machine learning tutorial"],
    n_results=5
)

# Query with filters
results = collection.query(
    query_texts=["Python programming"],
    n_results=3,
    where={"source": "web"}
)

# Query with metadata filters
results = collection.query(
    query_texts=["advanced topics"],
    where={
        "$and": [
            {"category": "tutorial"},
            {"difficulty": {"$gte": 3}}
        ]
    }
)

# Access results
print(results["documents"])      # List of matching documents
print(results["metadatas"])      # Metadata for each doc
print(results["distances"])      # Similarity scores
print(results["ids"])            # Document IDs

4. 獲取文檔

# Get by IDs
docs = collection.get(
    ids=["id1", "id2"]
)

# Get with filters
docs = collection.get(
    where={"category": "tutorial"},
    limit=10
)

# Get all documents
docs = collection.get()

5. 更新文檔

# Update document content
collection.update(
    ids=["id1"],
    documents=["Updated content"],
    metadatas=[{"source": "updated"}]
)

6. 刪除文檔

# Delete by IDs
collection.delete(ids=["id1", "id2"])

# Delete with filter
collection.delete(
    where={"source": "outdated"}
)

持久化存儲

# Persist to disk
client = chromadb.PersistentClient(path="./chroma_db")

collection = client.create_collection("my_docs")
collection.add(documents=["Doc 1"], ids=["id1"])

# Data persisted automatically
# Reload later with same path
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_collection("my_docs")

嵌入函數

默認（Sentence Transformers）

# Uses sentence-transformers by default
collection = client.create_collection("my_docs")
# Default model: all-MiniLM-L6-v2

OpenAI

from chromadb.utils import embedding_functions

openai_ef = embedding_functions.OpenAIEmbeddingFunction(
    api_key="your-key",
    model_name="text-embedding-3-small"
)

collection = client.create_collection(
    name="openai_docs",
    embedding_function=openai_ef
)

HuggingFace

huggingface_ef = embedding_functions.HuggingFaceEmbeddingFunction(
    api_key="your-key",
    model_name="sentence-transformers/all-mpnet-base-v2"
)

collection = client.create_collection(
    name="hf_docs",
    embedding_function=huggingface_ef
)

自定義嵌入函數

from chromadb import Documents, EmbeddingFunction, Embeddings

class MyEmbeddingFunction(EmbeddingFunction):
    def __call__(self, input: Documents) -> Embeddings:
        # Your embedding logic
        return embeddings

my_ef = MyEmbeddingFunction()
collection = client.create_collection(
    name="custom_docs",
    embedding_function=my_ef
)

元數據過濾

# Exact match
results = collection.query(
    query_texts=["query"],
    where={"category": "tutorial"}
)

# Comparison operators
results = collection.query(
    query_texts=["query"],
    where={"page": {"$gt": 10}}  # $gt, $gte, $lt, $lte, $ne
)

# Logical operators
results = collection.query(
    query_texts=["query"],
    where={
        "$and": [
            {"category": "tutorial"},
            {"difficulty": {"$lte": 3}}
        ]
    }  # Also: $or
)

# Contains
results = collection.query(
    query_texts=["query"],
    where={"tags": {"$in": ["python", "ml"]}}
)

LangChain 集成

from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Split documents
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000)
docs = text_splitter.split_documents(documents)

# Create Chroma vector store
vectorstore = Chroma.from_documents(
    documents=docs,
    embedding=OpenAIEmbeddings(),
    persist_directory="./chroma_db"
)

# Query
results = vectorstore.similarity_search("machine learning", k=3)

# As retriever
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

LlamaIndex 集成

from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import VectorStoreIndex, StorageContext
import chromadb

# Initialize Chroma
db = chromadb.PersistentClient(path="./chroma_db")
collection = db.get_or_create_collection("my_collection")

# Create vector store
vector_store = ChromaVectorStore(chroma_collection=collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# Create index
index = VectorStoreIndex.from_documents(
    documents,
    storage_context=storage_context
)

# Query
query_engine = index.as_query_engine()
response = query_engine.query("What is machine learning?")

服務器模式

# Run Chroma server
# Terminal: chroma run --path ./chroma_db --port 8000

# Connect to server
import chromadb
from chromadb.config import Settings

client = chromadb.HttpClient(
    host="localhost",
    port=8000,
    settings=Settings(anonymized_telemetry=False)
)

# Use as normal
collection = client.get_or_create_collection("my_docs")

最佳實踐

使用持久化客戶端 - 避免重啟後丟失數據
添加元數據 - 支持過濾和追蹤
批量操作 - 一次性添加多個文檔
選擇合適的嵌入模型 - 平衡速度與質量
使用過濾器 - 縮小搜索範圍
唯一 ID - 避免衝突
定期備份 - 複製 chroma_db 目錄
監控集合大小 - 必要時進行擴展
測試嵌入函數 - 確保質量
生產環境使用服務器模式 - 更適合多用戶場景

性能

操作	延遲	說明
添加 100 個文檔	~1-3秒	含嵌入計算
查詢（前 10 個結果）	~50-200毫秒	取決於集合大小
元數據過濾	~10-50毫秒	適當索引下速度很快

資源

GitHub: https://github.com/chroma-core/chroma ⭐ 24,300+
文檔: https://docs.trychroma.com
Discord: https://discord.gg/MMeYNTmh3x
版本: 1.3.3+
許可證: Apache 2.0

技能元數據​

參考：完整 SKILL.md​

Chroma - 開源嵌入數據庫

何時使用 Chroma​

快速開始​

安裝​

基本用法 (Python)​

核心操作​

1. 創建集合​

2. 添加文檔​

3. 查詢（相似度搜索）​

4. 獲取文檔​

5. 更新文檔​

6. 刪除文檔​

持久化存儲​

嵌入函數​

默認（Sentence Transformers）​

OpenAI​

HuggingFace​

自定義嵌入函數​

元數據過濾​

LangChain 集成​

LlamaIndex 集成​

服務器模式​

最佳實踐​

性能​

資源​