Clip
OpenAI 連接視覺與語言的模型。支持零樣本圖像分類、圖像-文本匹配和跨模態檢索。在 4 億個圖像-文本對上進行訓練。適用於無需微調的圖像搜索、內容審核或視覺-語言任務。最適合通用圖像理解。
技能元數據
| 來源 | 可選 — 使用 hermes skills install official/mlops/clip 安裝 |
| 路徑 | optional-skills/mlops/clip |
| 版本 | 1.0.0 |
| 作者 | Orchestra Research |
| 許可證 | MIT |
| 依賴項 | transformers, torch, pillow |
| 標籤 | Multimodal, CLIP, Vision-Language, Zero-Shot, Image Classification, OpenAI, Image Search, Cross-Modal Retrieval, Content Moderation |
參考:完整 SKILL.md
信息
以下是 Hermes 在觸發此技能時加載的完整技能定義。這是技能激活時代理看到的指令。
CLIP - 對比語言-圖像預訓練 (Contrastive Language-Image Pre-Training)
OpenAI 能夠通過自然語言理解圖像的模型。
何時使用 CLIP
使用時機:
- 零樣本圖像分類(無需訓練數據)
- 圖像-文本相似度/匹配
- 語義圖像搜索
- 內容審核(檢測色情、暴力內容)
- 視覺問答
- 跨模態檢索(圖像→文本,文本→圖像)
指標:
- GitHub 星標超過 25,300+
- 在 4 億個圖像-文本對上進行訓練
- 在 ImageNet 上(零樣本)性能媲美 ResNet-50
- MIT 許可證
改用其他替代方案:
- BLIP-2:更好的圖像描述生成
- LLaVA:視覺-語言聊天
- Segment Anything:圖像分割
快速開始
安裝
pip install git+https://github.com/openai/CLIP.git
pip install torch torchvision ftfy regex tqdm
零樣本分類
import torch
import clip
from PIL import Image
# Load model
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)
# Load image
image = preprocess(Image.open("photo.jpg")).unsqueeze(0).to(device)
# Define possible labels
text = clip.tokenize(["a dog", "a cat", "a bird", "a car"]).to(device)
# Compute similarity
with torch.no_grad():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
# Cosine similarity
logits_per_image, logits_per_text = model(image, text)
probs = logits_per_image.softmax(dim=-1).cpu().numpy()
# Print results
labels = ["a dog", "a cat", "a bird", "a car"]
for label, prob in zip(labels, probs[0]):
print(f"{label}: {prob:.2%}")
可用模型
# Models (sorted by size)
models = [
"RN50", # ResNet-50
"RN101", # ResNet-101
"ViT-B/32", # Vision Transformer (recommended)
"ViT-B/16", # Better quality, slower
"ViT-L/14", # Best quality, slowest
]
model, preprocess = clip.load("ViT-B/32")
| 模型 | 參數量 | 速度 | 質量 |
|---|---|---|---|
| RN50 | 102M | 快 | 良好 |
| ViT-B/32 | 151M | 中等 | 更好 |
| ViT-L/14 | 428M | 慢 | 最佳 |
圖像-文本相似度
# Compute embeddings
image_features = model.encode_image(image)
text_features = model.encode_text(text)
# Normalize
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
# Cosine similarity
similarity = (image_features @ text_features.T).item()
print(f"Similarity: {similarity:.4f}")
語義圖像搜索
# Index images
image_paths = ["img1.jpg", "img2.jpg", "img3.jpg"]
image_embeddings = []
for img_path in image_paths:
image = preprocess(Image.open(img_path)).unsqueeze(0).to(device)
with torch.no_grad():
embedding = model.encode_image(image)
embedding /= embedding.norm(dim=-1, keepdim=True)
image_embeddings.append(embedding)
image_embeddings = torch.cat(image_embeddings)
# Search with text query
query = "a sunset over the ocean"
text_input = clip.tokenize([query]).to(device)
with torch.no_grad():
text_embedding = model.encode_text(text_input)
text_embedding /= text_embedding.norm(dim=-1, keepdim=True)
# Find most similar images
similarities = (text_embedding @ image_embeddings.T).squeeze(0)
top_k = similarities.topk(3)
for idx, score in zip(top_k.indices, top_k.values):
print(f"{image_paths[idx]}: {score:.3f}")
內容審核
# Define categories
categories = [
"safe for work",
"not safe for work",
"violent content",
"graphic content"
]
text = clip.tokenize(categories).to(device)
# Check image
with torch.no_grad():
logits_per_image, _ = model(image, text)
probs = logits_per_image.softmax(dim=-1)
# Get classification
max_idx = probs.argmax().item()
max_prob = probs[0, max_idx].item()
print(f"Category: {categories[max_idx]} ({max_prob:.2%})")
批量處理
# Process multiple images
images = [preprocess(Image.open(f"img{i}.jpg")) for i in range(10)]
images = torch.stack(images).to(device)
with torch.no_grad():
image_features = model.encode_image(images)
image_features /= image_features.norm(dim=-1, keepdim=True)
# Batch text
texts = ["a dog", "a cat", "a bird"]
text_tokens = clip.tokenize(texts).to(device)
with torch.no_grad():
text_features = model.encode_text(text_tokens)
text_features /= text_features.norm(dim=-1, keepdim=True)
# Similarity matrix (10 images × 3 texts)
similarities = image_features @ text_features.T
print(similarities.shape) # (10, 3)
與向量數據庫集成
# Store CLIP embeddings in Chroma/FAISS
import chromadb
client = chromadb.Client()
collection = client.create_collection("image_embeddings")
# Add image embeddings
for img_path, embedding in zip(image_paths, image_embeddings):
collection.add(
embeddings=[embedding.cpu().numpy().tolist()],
metadatas=[{"path": img_path}],
ids=[img_path]
)
# Query with text
query = "a sunset"
text_embedding = model.encode_text(clip.tokenize([query]))
results = collection.query(
query_embeddings=[text_embedding.cpu().numpy().tolist()],
n_results=5
)
最佳實踐
- 大多數情況下使用 ViT-B/32 - 良好的平衡
- 歸一化嵌入向量 - 餘弦相似度所必需
- 批量處理 - 更高效
- 緩存嵌入向量 - 重新計算成本高
- 使用描述性標籤 - 更好的零樣本性能
- 推薦使用 GPU - 速度快 10-50 倍
- 預處理圖像 - 使用提供的預處理函數
性能
| 操作 | CPU | GPU (V100) |
|---|---|---|
| 圖像編碼 | ~200ms | ~20ms |
| 文本編碼 | ~50ms | ~5ms |
| 相似度計算 | <1ms | <1ms |
侷限性
- 不適用於細粒度任務 - 最適合 broad categories(寬泛類別)
- 需要描述性文本 - 模糊標籤表現不佳
- 基於網絡數據存在偏見 - 可能存在數據集偏見
- 無邊界框 - 僅支持整張圖像
- 空間理解有限 - 位置/計數能力較弱
資源
- GitHub: https://github.com/openai/CLIP ⭐ 25,300+
- 論文: https://arxiv.org/abs/2103.00020
- Colab: https://colab.research.google.com/github/openai/clip/
- 許可證: MIT