Segment Anything Model

用於零樣本遷移的圖像分割基礎模型。當您需要使用點、框或掩碼作為提示來分割圖像中的任意對象，或自動生成圖像中所有對象的掩碼時，請使用此模型。

技能元數據


來源	捆綁（默認安裝）
路徑	`skills/mlops/models/segment-anything`
版本	`1.0.0`
作者	Orchestra Research
許可證	MIT
依賴項	`segment-anything`, `transformers>=4.30.0`, `torch>=1.7.0`
標籤	`Multimodal`, `Image Segmentation`, `Computer Vision`, `SAM`, `Zero-Shot`

參考：完整 SKILL.md

信息

以下是 Hermes 在觸發此技能時加載的完整技能定義。這是技能激活時代理看到的指令。

Segment Anything Model (SAM)

使用 Meta AI 的 Segment Anything Model 進行零樣本圖像分割的綜合指南。

何時使用 SAM

在以下情況使用 SAM：

需要在無需特定任務訓練的情況下分割圖像中的任意對象
構建帶有點擊/框提示的交互式標註工具
為其他視覺模型生成訓練數據
需要零樣本遷移到新的圖像領域
構建對象檢測/分割流水線
處理醫療、衛星或特定領域的圖像

主要特性：

零樣本分割：無需微調即可適用於任何圖像領域
靈活的提示：支持點、邊界框或之前的掩碼
自動分割：自動生成所有對象掩碼
高質量：基於來自 1100 萬張圖像的 11 億個掩碼進行訓練
多種模型尺寸：ViT-B（最快）、ViT-L、ViT-H（最準確）
ONNX 導出：可在瀏覽器和邊緣設備中部署

改用其他替代方案：

YOLO/Detectron2：用於帶類別的實時對象檢測
Mask2Former：用於帶類別的語義/全景分割
GroundingDINO + SAM：用於文本提示分割
SAM 2：用於視頻分割任務

快速開始

安裝

# From GitHub
pip install git+https://github.com/facebookresearch/segment-anything.git

# Optional dependencies
pip install opencv-python pycocotools matplotlib

# Or use HuggingFace transformers
pip install transformers

下載檢查點

# ViT-H (largest, most accurate) - 2.4GB
wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth

# ViT-L (medium) - 1.2GB
wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_l_0b3195.pth

# ViT-B (smallest, fastest) - 375MB
wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_b_01ec64.pth

使用 SamPredictor 的基本用法

import numpy as np
from segment_anything import sam_model_registry, SamPredictor

# Load model
sam = sam_model_registry["vit_h"](https://github.com/NousResearch/hermes-agent/blob/main/skills/mlops/models/segment-anything/checkpoint="sam_vit_h_4b8939.pth")
sam.to(device="cuda")

# Create predictor
predictor = SamPredictor(sam)

# Set image (computes embeddings once)
image = cv2.imread("image.jpg")
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
predictor.set_image(image)

# Predict with point prompts
input_point = np.array([[500, 375]])  # (x, y) coordinates
input_label = np.array([1])  # 1 = foreground, 0 = background

masks, scores, logits = predictor.predict(
    point_coords=input_point,
    point_labels=input_label,
    multimask_output=True  # Returns 3 mask options
)

# Select best mask
best_mask = masks[np.argmax(scores)]

HuggingFace Transformers

import torch
from PIL import Image
from transformers import SamModel, SamProcessor

# Load model and processor
model = SamModel.from_pretrained("facebook/sam-vit-huge")
processor = SamProcessor.from_pretrained("facebook/sam-vit-huge")
model.to("cuda")

# Process image with point prompt
image = Image.open("image.jpg")
input_points = [[[450, 600]]]  # Batch of points

inputs = processor(image, input_points=input_points, return_tensors="pt")
inputs = {k: v.to("cuda") for k, v in inputs.items()}

# Generate masks
with torch.no_grad():
    outputs = model(**inputs)

# Post-process masks to original size
masks = processor.image_processor.post_process_masks(
    outputs.pred_masks.cpu(),
    inputs["original_sizes"].cpu(),
    inputs["reshaped_input_sizes"].cpu()
)

核心概念

模型架構

SAM Architecture:
┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│  Image Encoder  │────▶│ Prompt Encoder  │────▶│  Mask Decoder   │
│     (ViT)       │     │ (Points/Boxes)  │     │ (Transformer)   │
└─────────────────┘     └─────────────────┘     └─────────────────┘
        │                       │                       │
   Image Embeddings      Prompt Embeddings         Masks + IoU
   (computed once)       (per prompt)             predictions

模型變體

模型	檢查點	大小	速度	準確度
ViT-H	`vit_h`	2.4 GB	最慢	最佳
ViT-L	`vit_l`	1.2 GB	中等	良好
ViT-B	`vit_b`	375 MB	最快	良好

提示類型

提示	描述	用例
點（前景）	點擊對象	單個對象選擇
點（背景）	點擊對象外部	排除區域
邊界框	對象周圍的矩形	較大對象
之前的掩碼	低分辨率掩碼輸入	迭代優化

交互式分割

點提示

# Single foreground point
input_point = np.array([[500, 375]])
input_label = np.array([1])

masks, scores, logits = predictor.predict(
    point_coords=input_point,
    point_labels=input_label,
    multimask_output=True
)

# Multiple points (foreground + background)
input_points = np.array([[500, 375], [600, 400], [450, 300]])
input_labels = np.array([1, 1, 0])  # 2 foreground, 1 background

masks, scores, logits = predictor.predict(
    point_coords=input_points,
    point_labels=input_labels,
    multimask_output=False  # Single mask when prompts are clear
)

框提示

# Bounding box [x1, y1, x2, y2]
input_box = np.array([425, 600, 700, 875])

masks, scores, logits = predictor.predict(
    box=input_box,
    multimask_output=False
)

組合提示

# Box + points for precise control
masks, scores, logits = predictor.predict(
    point_coords=np.array([[500, 375]]),
    point_labels=np.array([1]),
    box=np.array([400, 300, 700, 600]),
    multimask_output=False
)

# Initial prediction
masks, scores, logits = predictor.predict(
    point_coords=np.array([[500, 375]]),
    point_labels=np.array([1]),
    multimask_output=True
)

# Refine with additional point using previous mask
masks, scores, logits = predictor.predict(
    point_coords=np.array([[500, 375], [550, 400]]),
    point_labels=np.array([1, 0]),  # Add background point
    mask_input=logits[np.argmax(scores)][None, :, :],  # Use best mask
    multimask_output=False
)

自動掩碼生成

基本自動分割

from segment_anything import SamAutomaticMaskGenerator

# Create generator
mask_generator = SamAutomaticMaskGenerator(sam)

# Generate all masks
masks = mask_generator.generate(image)

# Each mask contains:
# - segmentation: binary mask
# - bbox: [x, y, w, h]
# - area: pixel count
# - predicted_iou: quality score
# - stability_score: robustness score
# - point_coords: generating point

自定義生成

mask_generator = SamAutomaticMaskGenerator(
    model=sam,
    points_per_side=32,          # Grid density (more = more masks)
    pred_iou_thresh=0.88,        # Quality threshold
    stability_score_thresh=0.95,  # Stability threshold
    crop_n_layers=1,             # Multi-scale crops
    crop_n_points_downscale_factor=2,
    min_mask_region_area=100,    # Remove tiny masks
)

masks = mask_generator.generate(image)

過濾掩碼

# Sort by area (largest first)
masks = sorted(masks, key=lambda x: x['area'], reverse=True)

# Filter by predicted IoU
high_quality = [m for m in masks if m['predicted_iou'] > 0.9]

# Filter by stability score
stable_masks = [m for m in masks if m['stability_score'] > 0.95]

批量推理

多張圖像

# Process multiple images efficiently
images = [cv2.imread(f"image_{i}.jpg") for i in range(10)]

all_masks = []
for image in images:
    predictor.set_image(image)
    masks, _, _ = predictor.predict(
        point_coords=np.array([[500, 375]]),
        point_labels=np.array([1]),
        multimask_output=True
    )
    all_masks.append(masks)

每張圖像多個提示

# Process multiple prompts efficiently (one image encoding)
predictor.set_image(image)

# Batch of point prompts
points = [
    np.array([[100, 100]]),
    np.array([[200, 200]]),
    np.array([[300, 300]])
]

all_masks = []
for point in points:
    masks, scores, _ = predictor.predict(
        point_coords=point,
        point_labels=np.array([1]),
        multimask_output=True
    )
    all_masks.append(masks[np.argmax(scores)])

ONNX 部署

導出模型

python scripts/export_onnx_model.py \
    --checkpoint sam_vit_h_4b8939.pth \
    --model-type vit_h \
    --output sam_onnx.onnx \
    --return-single-mask

使用 ONNX 模型

import onnxruntime

# Load ONNX model
ort_session = onnxruntime.InferenceSession("sam_onnx.onnx")

# Run inference (image embeddings computed separately)
masks = ort_session.run(
    None,
    {
        "image_embeddings": image_embeddings,
        "point_coords": point_coords,
        "point_labels": point_labels,
        "mask_input": np.zeros((1, 1, 256, 256), dtype=np.float32),
        "has_mask_input": np.array([0], dtype=np.float32),
        "orig_im_size": np.array([h, w], dtype=np.float32)
    }
)

常見工作流

工作流 1：標註工具

import cv2

# Load model
predictor = SamPredictor(sam)
predictor.set_image(image)

def on_click(event, x, y, flags, param):
    if event == cv2.EVENT_LBUTTONDOWN:
        # Foreground point
        masks, scores, _ = predictor.predict(
            point_coords=np.array([[x, y]]),
            point_labels=np.array([1]),
            multimask_output=True
        )
        # Display best mask
        display_mask(masks[np.argmax(scores)])

工作流 2：對象提取

def extract_object(image, point):
    """Extract object at point with transparent background."""
    predictor.set_image(image)

    masks, scores, _ = predictor.predict(
        point_coords=np.array([point]),
        point_labels=np.array([1]),
        multimask_output=True
    )

    best_mask = masks[np.argmax(scores)]

    # Create RGBA output
    rgba = np.zeros((image.shape[0], image.shape[1], 4), dtype=np.uint8)
    rgba[:, :, :3] = image
    rgba[:, :, 3] = best_mask * 255

    return rgba

工作流 3：醫學圖像分割

# Process medical images (grayscale to RGB)
medical_image = cv2.imread("scan.png", cv2.IMREAD_GRAYSCALE)
rgb_image = cv2.cvtColor(medical_image, cv2.COLOR_GRAY2RGB)

predictor.set_image(rgb_image)

# Segment region of interest
masks, scores, _ = predictor.predict(
    box=np.array([x1, y1, x2, y2]),  # ROI bounding box
    multimask_output=True
)

輸出格式

掩碼數據結構

# SamAutomaticMaskGenerator output
{
    "segmentation": np.ndarray,  # H×W binary mask
    "bbox": [x, y, w, h],        # Bounding box
    "area": int,                 # Pixel count
    "predicted_iou": float,      # 0-1 quality score
    "stability_score": float,    # 0-1 robustness score
    "crop_box": [x, y, w, h],    # Generation crop region
    "point_coords": [[x, y]],    # Input point
}

COCO RLE 格式

from pycocotools import mask as mask_utils

# Encode mask to RLE
rle = mask_utils.encode(np.asfortranarray(mask.astype(np.uint8)))
rle["counts"] = rle["counts"].decode("utf-8")

# Decode RLE to mask
decoded_mask = mask_utils.decode(rle)

性能優化

GPU 顯存

# Use smaller model for limited VRAM
sam = sam_model_registry["vit_b"](https://github.com/NousResearch/hermes-agent/blob/main/skills/mlops/models/segment-anything/checkpoint="sam_vit_b_01ec64.pth")

# Process images in batches
# Clear CUDA cache between large batches
torch.cuda.empty_cache()

速度優化

# Use half precision
sam = sam.half()

# Reduce points for automatic generation
mask_generator = SamAutomaticMaskGenerator(
    model=sam,
    points_per_side=16,  # Default is 32
)

# Use ONNX for deployment
# Export with --return-single-mask for faster inference

常見問題

問題	解決方案
內存不足	使用 ViT-B 模型，減小圖像尺寸
推理緩慢	使用 ViT-B，減少 points_per_side
掩碼質量差	嘗試不同的提示，使用框 + 點
邊緣偽影	使用 stability_score 過濾
遺漏小對象	增加 points_per_side

參考資料

高級用法 - 批處理、微調、集成
故障排除 - 常見問題及解決方案

資源

GitHub: https://github.com/facebookresearch/segment-anything
論文: https://arxiv.org/abs/2304.02643
演示: https://segment-anything.com
SAM 2（視頻）: https://github.com/facebookresearch/segment-anything-2
HuggingFace: https://huggingface.co/facebook/sam-vit-huge

技能元數據​

參考：完整 SKILL.md​

Segment Anything Model (SAM)

何時使用 SAM​

快速開始​

安裝​

下載檢查點​

使用 SamPredictor 的基本用法​

HuggingFace Transformers​

核心概念​

模型架構​

模型變體​

提示類型​

交互式分割​

點提示​

框提示​

組合提示​

迭代優化​

自動掩碼生成​

基本自動分割​

自定義生成​

過濾掩碼​

批量推理​

多張圖像​

每張圖像多個提示​

ONNX 部署​

導出模型​

使用 ONNX 模型​

常見工作流​

工作流 1：標註工具​

工作流 2：對象提取​

工作流 3：醫學圖像分割​

輸出格式​

掩碼數據結構​

COCO RLE 格式​

性能優化​

GPU 顯存​

速度優化​

常見問題​

參考資料​

資源​