Chroma
用於 AI 應用的開源嵌入數據庫。存儲嵌入和元數據,執行向量和全文搜索,按元數據過濾。簡單的四函數 API。可從筆記本環境擴展到生產集群。適用於語義搜索、RAG 應用或文檔檢索。最適合本地開發和開源項目。
技能元數據
| 來源 | 可選 — 使用 hermes skills install official/mlops/chroma 安裝 |
| 路徑 | optional-skills/mlops/chroma |
| 版本 | 1.0.0 |
| 作者 | Orchestra Research |
| 許可證 | MIT |
| 依賴項 | chromadb, sentence-transformers |
| 標籤 | RAG, Chroma, Vector Database, Embeddings, Semantic Search, Open Source, Self-Hosted, Document Retrieval, Metadata Filtering |
參考:完整 SKILL.md
信息
以下是 Hermes 在觸發此技能時加載的完整技能定義。這是技能激活時代理看到的指令。
Chroma - 開源嵌入數據庫
用於構建具備記憶功能的 LLM 應用的 AI 原生數據庫。
何時使用 Chroma
在以下情況使用 Chroma:
- 構建 RAG(檢索增強生成)應用
- 需要本地/自託管向量數據庫
- 希望使用開源解決方案(Apache 2.0)
- 在筆記本中進行原型設計
- 對文檔進行語義搜索
- 存儲帶有元數據的嵌入
指標:
- GitHub 星標 24,300+
- Fork 數 1,900+
- v1.3.3(穩定版,每週發佈)
- Apache 2.0 許可證
改用其他替代方案:
- Pinecone:託管雲服務,自動擴縮容
- FAISS:純相似度搜索,無元數據支持
- Weaviate:面向生產的 ML 原生數據庫
- Qdrant:高性能,基於 Rust
快速開始
安裝
# Python
pip install chromadb
# JavaScript/TypeScript
npm install chromadb @chroma-core/default-embed
基本用法 (Python)
import chromadb
# Create client
client = chromadb.Client()
# Create collection
collection = client.create_collection(name="my_collection")
# Add documents
collection.add(
documents=["This is document 1", "This is document 2"],
metadatas=[{"source": "doc1"}, {"source": "doc2"}],
ids=["id1", "id2"]
)
# Query
results = collection.query(
query_texts=["document about topic"],
n_results=2
)
print(results)
核心操作
1. 創建集合
# Simple collection
collection = client.create_collection("my_docs")
# With custom embedding function
from chromadb.utils import embedding_functions
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
api_key="your-key",
model_name="text-embedding-3-small"
)
collection = client.create_collection(
name="my_docs",
embedding_function=openai_ef
)
# Get existing collection
collection = client.get_collection("my_docs")
# Delete collection
client.delete_collection("my_docs")
2. 添加文檔
# Add with auto-generated IDs
collection.add(
documents=["Doc 1", "Doc 2", "Doc 3"],
metadatas=[
{"source": "web", "category": "tutorial"},
{"source": "pdf", "page": 5},
{"source": "api", "timestamp": "2025-01-01"}
],
ids=["id1", "id2", "id3"]
)
# Add with custom embeddings
collection.add(
embeddings=[[0.1, 0.2, ...], [0.3, 0.4, ...]],
documents=["Doc 1", "Doc 2"],
ids=["id1", "id2"]
)
3. 查詢(相似度搜索)
# Basic query
results = collection.query(
query_texts=["machine learning tutorial"],
n_results=5
)
# Query with filters
results = collection.query(
query_texts=["Python programming"],
n_results=3,
where={"source": "web"}
)
# Query with metadata filters
results = collection.query(
query_texts=["advanced topics"],
where={
"$and": [
{"category": "tutorial"},
{"difficulty": {"$gte": 3}}
]
}
)
# Access results
print(results["documents"]) # List of matching documents
print(results["metadatas"]) # Metadata for each doc
print(results["distances"]) # Similarity scores
print(results["ids"]) # Document IDs
4. 獲取文檔
# Get by IDs
docs = collection.get(
ids=["id1", "id2"]
)
# Get with filters
docs = collection.get(
where={"category": "tutorial"},
limit=10
)
# Get all documents
docs = collection.get()
5. 更新文檔
# Update document content
collection.update(
ids=["id1"],
documents=["Updated content"],
metadatas=[{"source": "updated"}]
)
6. 刪除文檔
# Delete by IDs
collection.delete(ids=["id1", "id2"])
# Delete with filter
collection.delete(
where={"source": "outdated"}
)
持久化存儲
# Persist to disk
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.create_collection("my_docs")
collection.add(documents=["Doc 1"], ids=["id1"])
# Data persisted automatically
# Reload later with same path
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_collection("my_docs")
嵌入函數
默認(Sentence Transformers)
# Uses sentence-transformers by default
collection = client.create_collection("my_docs")
# Default model: all-MiniLM-L6-v2
OpenAI
from chromadb.utils import embedding_functions
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
api_key="your-key",
model_name="text-embedding-3-small"
)
collection = client.create_collection(
name="openai_docs",
embedding_function=openai_ef
)
HuggingFace
huggingface_ef = embedding_functions.HuggingFaceEmbeddingFunction(
api_key="your-key",
model_name="sentence-transformers/all-mpnet-base-v2"
)
collection = client.create_collection(
name="hf_docs",
embedding_function=huggingface_ef
)
自定義嵌入函數
from chromadb import Documents, EmbeddingFunction, Embeddings
class MyEmbeddingFunction(EmbeddingFunction):
def __call__(self, input: Documents) -> Embeddings:
# Your embedding logic
return embeddings
my_ef = MyEmbeddingFunction()
collection = client.create_collection(
name="custom_docs",
embedding_function=my_ef
)
元數據過濾
# Exact match
results = collection.query(
query_texts=["query"],
where={"category": "tutorial"}
)
# Comparison operators
results = collection.query(
query_texts=["query"],
where={"page": {"$gt": 10}} # $gt, $gte, $lt, $lte, $ne
)
# Logical operators
results = collection.query(
query_texts=["query"],
where={
"$and": [
{"category": "tutorial"},
{"difficulty": {"$lte": 3}}
]
} # Also: $or
)
# Contains
results = collection.query(
query_texts=["query"],
where={"tags": {"$in": ["python", "ml"]}}
)
LangChain 集成
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Split documents
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000)
docs = text_splitter.split_documents(documents)
# Create Chroma vector store
vectorstore = Chroma.from_documents(
documents=docs,
embedding=OpenAIEmbeddings(),
persist_directory="./chroma_db"
)
# Query
results = vectorstore.similarity_search("machine learning", k=3)
# As retriever
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
LlamaIndex 集成
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import VectorStoreIndex, StorageContext
import chromadb
# Initialize Chroma
db = chromadb.PersistentClient(path="./chroma_db")
collection = db.get_or_create_collection("my_collection")
# Create vector store
vector_store = ChromaVectorStore(chroma_collection=collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
# Create index
index = VectorStoreIndex.from_documents(
documents,
storage_context=storage_context
)
# Query
query_engine = index.as_query_engine()
response = query_engine.query("What is machine learning?")
服務器模式
# Run Chroma server
# Terminal: chroma run --path ./chroma_db --port 8000
# Connect to server
import chromadb
from chromadb.config import Settings
client = chromadb.HttpClient(
host="localhost",
port=8000,
settings=Settings(anonymized_telemetry=False)
)
# Use as normal
collection = client.get_or_create_collection("my_docs")
最佳實踐
- 使用持久化客戶端 - 避免重啟後丟失數據
- 添加元數據 - 支持過濾和追蹤
- 批量操作 - 一次性添加多個文檔
- 選擇合適的嵌入模型 - 平衡速度與質量
- 使用過濾器 - 縮小搜索範圍
- 唯一 ID - 避免衝突
- 定期備份 - 複製 chroma_db 目錄
- 監控集合大小 - 必要時進行擴展
- 測試嵌入函數 - 確保質量
- 生產環境使用服務器模式 - 更適合多用戶場景
性能
| 操作 | 延遲 | 說明 |
|---|---|---|
| 添加 100 個文檔 | ~1-3秒 | 含嵌入計算 |
| 查詢(前 10 個結果) | ~50-200毫秒 | 取決於集合大小 |
| 元數據過濾 | ~10-50毫秒 | 適當索引下速度很快 |
資源
- GitHub: https://github.com/chroma-core/chroma ⭐ 24,300+
- 文檔: https://docs.trychroma.com
- Discord: https://discord.gg/MMeYNTmh3x
- 版本: 1.3.3+
- 許可證: Apache 2.0