跳到主要内容

Chroma

用于 AI 应用的开源嵌入数据库。存储嵌入和元数据,执行向量和全文搜索,按元数据过滤。简单的四函数 API。可从笔记本环境扩展到生产集群。适用于语义搜索、RAG 应用或文档检索。最适合本地开发和开源项目。

技能元数据

来源可选 — 使用 hermes skills install official/mlops/chroma 安装
路径optional-skills/mlops/chroma
版本1.0.0
作者Orchestra Research
许可证MIT
依赖项chromadb, sentence-transformers
标签RAG, Chroma, Vector Database, Embeddings, Semantic Search, Open Source, Self-Hosted, Document Retrieval, Metadata Filtering

参考:完整 SKILL.md

信息

以下是 Hermes 在触发此技能时加载的完整技能定义。这是技能激活时代理看到的指令。

Chroma - 开源嵌入数据库

用于构建具备记忆功能的 LLM 应用的 AI 原生数据库。

何时使用 Chroma

在以下情况使用 Chroma:

  • 构建 RAG(检索增强生成)应用
  • 需要本地/自托管向量数据库
  • 希望使用开源解决方案(Apache 2.0)
  • 在笔记本中进行原型设计
  • 对文档进行语义搜索
  • 存储带有元数据的嵌入

指标

  • GitHub 星标 24,300+
  • Fork 数 1,900+
  • v1.3.3(稳定版,每周发布)
  • Apache 2.0 许可证

改用其他替代方案

  • Pinecone:托管云服务,自动扩缩容
  • FAISS:纯相似度搜索,无元数据支持
  • Weaviate:面向生产的 ML 原生数据库
  • Qdrant:高性能,基于 Rust

快速开始

安装

# Python
pip install chromadb

# JavaScript/TypeScript
npm install chromadb @chroma-core/default-embed

基本用法 (Python)

import chromadb

# Create client
client = chromadb.Client()

# Create collection
collection = client.create_collection(name="my_collection")

# Add documents
collection.add(
documents=["This is document 1", "This is document 2"],
metadatas=[{"source": "doc1"}, {"source": "doc2"}],
ids=["id1", "id2"]
)

# Query
results = collection.query(
query_texts=["document about topic"],
n_results=2
)

print(results)

核心操作

1. 创建集合

# Simple collection
collection = client.create_collection("my_docs")

# With custom embedding function
from chromadb.utils import embedding_functions

openai_ef = embedding_functions.OpenAIEmbeddingFunction(
api_key="your-key",
model_name="text-embedding-3-small"
)

collection = client.create_collection(
name="my_docs",
embedding_function=openai_ef
)

# Get existing collection
collection = client.get_collection("my_docs")

# Delete collection
client.delete_collection("my_docs")

2. 添加文档

# Add with auto-generated IDs
collection.add(
documents=["Doc 1", "Doc 2", "Doc 3"],
metadatas=[
{"source": "web", "category": "tutorial"},
{"source": "pdf", "page": 5},
{"source": "api", "timestamp": "2025-01-01"}
],
ids=["id1", "id2", "id3"]
)

# Add with custom embeddings
collection.add(
embeddings=[[0.1, 0.2, ...], [0.3, 0.4, ...]],
documents=["Doc 1", "Doc 2"],
ids=["id1", "id2"]
)
# Basic query
results = collection.query(
query_texts=["machine learning tutorial"],
n_results=5
)

# Query with filters
results = collection.query(
query_texts=["Python programming"],
n_results=3,
where={"source": "web"}
)

# Query with metadata filters
results = collection.query(
query_texts=["advanced topics"],
where={
"$and": [
{"category": "tutorial"},
{"difficulty": {"$gte": 3}}
]
}
)

# Access results
print(results["documents"]) # List of matching documents
print(results["metadatas"]) # Metadata for each doc
print(results["distances"]) # Similarity scores
print(results["ids"]) # Document IDs

4. 获取文档

# Get by IDs
docs = collection.get(
ids=["id1", "id2"]
)

# Get with filters
docs = collection.get(
where={"category": "tutorial"},
limit=10
)

# Get all documents
docs = collection.get()

5. 更新文档

# Update document content
collection.update(
ids=["id1"],
documents=["Updated content"],
metadatas=[{"source": "updated"}]
)

6. 删除文档

# Delete by IDs
collection.delete(ids=["id1", "id2"])

# Delete with filter
collection.delete(
where={"source": "outdated"}
)

持久化存储

# Persist to disk
client = chromadb.PersistentClient(path="./chroma_db")

collection = client.create_collection("my_docs")
collection.add(documents=["Doc 1"], ids=["id1"])

# Data persisted automatically
# Reload later with same path
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_collection("my_docs")

嵌入函数

默认(Sentence Transformers)

# Uses sentence-transformers by default
collection = client.create_collection("my_docs")
# Default model: all-MiniLM-L6-v2

OpenAI

from chromadb.utils import embedding_functions

openai_ef = embedding_functions.OpenAIEmbeddingFunction(
api_key="your-key",
model_name="text-embedding-3-small"
)

collection = client.create_collection(
name="openai_docs",
embedding_function=openai_ef
)

HuggingFace

huggingface_ef = embedding_functions.HuggingFaceEmbeddingFunction(
api_key="your-key",
model_name="sentence-transformers/all-mpnet-base-v2"
)

collection = client.create_collection(
name="hf_docs",
embedding_function=huggingface_ef
)

自定义嵌入函数

from chromadb import Documents, EmbeddingFunction, Embeddings

class MyEmbeddingFunction(EmbeddingFunction):
def __call__(self, input: Documents) -> Embeddings:
# Your embedding logic
return embeddings

my_ef = MyEmbeddingFunction()
collection = client.create_collection(
name="custom_docs",
embedding_function=my_ef
)

元数据过滤

# Exact match
results = collection.query(
query_texts=["query"],
where={"category": "tutorial"}
)

# Comparison operators
results = collection.query(
query_texts=["query"],
where={"page": {"$gt": 10}} # $gt, $gte, $lt, $lte, $ne
)

# Logical operators
results = collection.query(
query_texts=["query"],
where={
"$and": [
{"category": "tutorial"},
{"difficulty": {"$lte": 3}}
]
} # Also: $or
)

# Contains
results = collection.query(
query_texts=["query"],
where={"tags": {"$in": ["python", "ml"]}}
)

LangChain 集成

from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Split documents
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000)
docs = text_splitter.split_documents(documents)

# Create Chroma vector store
vectorstore = Chroma.from_documents(
documents=docs,
embedding=OpenAIEmbeddings(),
persist_directory="./chroma_db"
)

# Query
results = vectorstore.similarity_search("machine learning", k=3)

# As retriever
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

LlamaIndex 集成

from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import VectorStoreIndex, StorageContext
import chromadb

# Initialize Chroma
db = chromadb.PersistentClient(path="./chroma_db")
collection = db.get_or_create_collection("my_collection")

# Create vector store
vector_store = ChromaVectorStore(chroma_collection=collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# Create index
index = VectorStoreIndex.from_documents(
documents,
storage_context=storage_context
)

# Query
query_engine = index.as_query_engine()
response = query_engine.query("What is machine learning?")

服务器模式

# Run Chroma server
# Terminal: chroma run --path ./chroma_db --port 8000

# Connect to server
import chromadb
from chromadb.config import Settings

client = chromadb.HttpClient(
host="localhost",
port=8000,
settings=Settings(anonymized_telemetry=False)
)

# Use as normal
collection = client.get_or_create_collection("my_docs")

最佳实践

  1. 使用持久化客户端 - 避免重启后丢失数据
  2. 添加元数据 - 支持过滤和追踪
  3. 批量操作 - 一次性添加多个文档
  4. 选择合适的嵌入模型 - 平衡速度与质量
  5. 使用过滤器 - 缩小搜索范围
  6. 唯一 ID - 避免冲突
  7. 定期备份 - 复制 chroma_db 目录
  8. 监控集合大小 - 必要时进行扩展
  9. 测试嵌入函数 - 确保质量
  10. 生产环境使用服务器模式 - 更适合多用户场景

性能

操作延迟说明
添加 100 个文档~1-3秒含嵌入计算
查询(前 10 个结果)~50-200毫秒取决于集合大小
元数据过滤~10-50毫秒适当索引下速度很快

资源