Clip

OpenAI 連接視覺與語言的模型。支持零樣本圖像分類、圖像-文本匹配和跨模態檢索。在 4 億個圖像-文本對上進行訓練。適用於無需微調的圖像搜索、內容審核或視覺-語言任務。最適合通用圖像理解。

技能元數據


來源	可選 — 使用 `hermes skills install official/mlops/clip` 安裝
路徑	`optional-skills/mlops/clip`
版本	`1.0.0`
作者	Orchestra Research
許可證	MIT
依賴項	`transformers`, `torch`, `pillow`
標籤	`Multimodal`, `CLIP`, `Vision-Language`, `Zero-Shot`, `Image Classification`, `OpenAI`, `Image Search`, `Cross-Modal Retrieval`, `Content Moderation`

參考：完整 SKILL.md

信息

以下是 Hermes 在觸發此技能時加載的完整技能定義。這是技能激活時代理看到的指令。

CLIP - 對比語言-圖像預訓練 (Contrastive Language-Image Pre-Training)

OpenAI 能夠通過自然語言理解圖像的模型。

何時使用 CLIP

使用時機：

零樣本圖像分類（無需訓練數據）
圖像-文本相似度/匹配
語義圖像搜索
內容審核（檢測色情、暴力內容）
視覺問答
跨模態檢索（圖像→文本，文本→圖像）

指標：

GitHub 星標超過 25,300+
在 4 億個圖像-文本對上進行訓練
在 ImageNet 上（零樣本）性能媲美 ResNet-50
MIT 許可證

改用其他替代方案：

BLIP-2：更好的圖像描述生成
LLaVA：視覺-語言聊天
Segment Anything：圖像分割

快速開始

安裝

pip install git+https://github.com/openai/CLIP.git
pip install torch torchvision ftfy regex tqdm

零樣本分類

import torch
import clip
from PIL import Image

# Load model
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

# Load image
image = preprocess(Image.open("photo.jpg")).unsqueeze(0).to(device)

# Define possible labels
text = clip.tokenize(["a dog", "a cat", "a bird", "a car"]).to(device)

# Compute similarity
with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)

    # Cosine similarity
    logits_per_image, logits_per_text = model(image, text)
    probs = logits_per_image.softmax(dim=-1).cpu().numpy()

# Print results
labels = ["a dog", "a cat", "a bird", "a car"]
for label, prob in zip(labels, probs[0]):
    print(f"{label}: {prob:.2%}")

可用模型

# Models (sorted by size)
models = [
    "RN50",           # ResNet-50
    "RN101",          # ResNet-101
    "ViT-B/32",       # Vision Transformer (recommended)
    "ViT-B/16",       # Better quality, slower
    "ViT-L/14",       # Best quality, slowest
]

model, preprocess = clip.load("ViT-B/32")

模型	參數量	速度	質量
RN50	102M	快	良好
ViT-B/32	151M	中等	更好
ViT-L/14	428M	慢	最佳

圖像-文本相似度

# Compute embeddings
image_features = model.encode_image(image)
text_features = model.encode_text(text)

# Normalize
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)

# Cosine similarity
similarity = (image_features @ text_features.T).item()
print(f"Similarity: {similarity:.4f}")

語義圖像搜索

# Index images
image_paths = ["img1.jpg", "img2.jpg", "img3.jpg"]
image_embeddings = []

for img_path in image_paths:
    image = preprocess(Image.open(img_path)).unsqueeze(0).to(device)
    with torch.no_grad():
        embedding = model.encode_image(image)
        embedding /= embedding.norm(dim=-1, keepdim=True)
    image_embeddings.append(embedding)

image_embeddings = torch.cat(image_embeddings)

# Search with text query
query = "a sunset over the ocean"
text_input = clip.tokenize([query]).to(device)
with torch.no_grad():
    text_embedding = model.encode_text(text_input)
    text_embedding /= text_embedding.norm(dim=-1, keepdim=True)

# Find most similar images
similarities = (text_embedding @ image_embeddings.T).squeeze(0)
top_k = similarities.topk(3)

for idx, score in zip(top_k.indices, top_k.values):
    print(f"{image_paths[idx]}: {score:.3f}")

內容審核

# Define categories
categories = [
    "safe for work",
    "not safe for work",
    "violent content",
    "graphic content"
]

text = clip.tokenize(categories).to(device)

# Check image
with torch.no_grad():
    logits_per_image, _ = model(image, text)
    probs = logits_per_image.softmax(dim=-1)

# Get classification
max_idx = probs.argmax().item()
max_prob = probs[0, max_idx].item()

print(f"Category: {categories[max_idx]} ({max_prob:.2%})")

批量處理

# Process multiple images
images = [preprocess(Image.open(f"img{i}.jpg")) for i in range(10)]
images = torch.stack(images).to(device)

with torch.no_grad():
    image_features = model.encode_image(images)
    image_features /= image_features.norm(dim=-1, keepdim=True)

# Batch text
texts = ["a dog", "a cat", "a bird"]
text_tokens = clip.tokenize(texts).to(device)

with torch.no_grad():
    text_features = model.encode_text(text_tokens)
    text_features /= text_features.norm(dim=-1, keepdim=True)

# Similarity matrix (10 images × 3 texts)
similarities = image_features @ text_features.T
print(similarities.shape)  # (10, 3)

與向量數據庫集成

# Store CLIP embeddings in Chroma/FAISS
import chromadb

client = chromadb.Client()
collection = client.create_collection("image_embeddings")

# Add image embeddings
for img_path, embedding in zip(image_paths, image_embeddings):
    collection.add(
        embeddings=[embedding.cpu().numpy().tolist()],
        metadatas=[{"path": img_path}],
        ids=[img_path]
    )

# Query with text
query = "a sunset"
text_embedding = model.encode_text(clip.tokenize([query]))
results = collection.query(
    query_embeddings=[text_embedding.cpu().numpy().tolist()],
    n_results=5
)

最佳實踐

大多數情況下使用 ViT-B/32 - 良好的平衡
歸一化嵌入向量 - 餘弦相似度所必需
批量處理 - 更高效
緩存嵌入向量 - 重新計算成本高
使用描述性標籤 - 更好的零樣本性能
推薦使用 GPU - 速度快 10-50 倍
預處理圖像 - 使用提供的預處理函數

性能

操作	CPU	GPU (V100)
圖像編碼	~200ms	~20ms
文本編碼	~50ms	~5ms
相似度計算	<1ms	<1ms

侷限性

不適用於細粒度任務 - 最適合 broad categories（寬泛類別）
需要描述性文本 - 模糊標籤表現不佳
基於網絡數據存在偏見 - 可能存在數據集偏見
無邊界框 - 僅支持整張圖像
空間理解有限 - 位置/計數能力較弱

資源

GitHub: https://github.com/openai/CLIP ⭐ 25,300+
論文: https://arxiv.org/abs/2103.00020
Colab: https://colab.research.google.com/github/openai/clip/
許可證: MIT

技能元數據​

參考：完整 SKILL.md​

CLIP - 對比語言-圖像預訓練 (Contrastive Language-Image Pre-Training)

何時使用 CLIP​

快速開始​

安裝​

零樣本分類​

可用模型​

圖像-文本相似度​

語義圖像搜索​

內容審核​

批量處理​

與向量數據庫集成​

最佳實踐​

性能​

侷限性​

資源​

技能元數據

參考：完整 SKILL.md

何時使用 CLIP

快速開始

安裝

零樣本分類

可用模型

圖像-文本相似度

語義圖像搜索

內容審核

批量處理

與向量數據庫集成

最佳實踐

性能

侷限性

資源