Nemo Curator
用於大語言模型(LLM)訓練的 GPU 加速數據整理工具。支持文本/圖像/視頻/音頻。具備模糊去重(速度提升 16 倍)、質量過濾(30+ 啟發式規則)、語義去重、個人身份信息(PII)脫敏、非安全內容(NSFW)檢測等功能。藉助 RAPIDS 實現跨 GPU 擴展。適用於準備高質量訓練數據集、清洗網絡數據或對大規模語料庫進行去重。
技能元數據
| 來源 | 可選 — 使用 hermes skills install official/mlops/nemo-curator 安裝 |
| 路徑 | optional-skills/mlops/nemo-curator |
| 版本 | 1.0.0 |
| 作者 | Orchestra Research |
| 許可證 | MIT |
| 依賴項 | nemo-curator, cudf, dask, rapids |
| 標籤 | Data Processing, NeMo Curator, Data Curation, GPU Acceleration, Deduplication, Quality Filtering, NVIDIA, RAPIDS, PII Redaction, Multimodal, LLM Training Data |
參考:完整 SKILL.md
信息
以下是 Hermes 在觸發此技能時加載的完整技能定義。這是技能激活時代理所看到的指令。
NeMo Curator - GPU 加速數據整理
NVIDIA 用於為大語言模型(LLM)準備高質量訓練數據的工具包。
何時使用 NeMo Curator
在以下情況使用 NeMo Curator:
- 從網絡抓取數據(Common Crawl)中準備 LLM 訓練數據
- 需要快速去重(比 CPU 快 16 倍)
- 整理多模態數據集(文本、圖像、視頻、音頻)
- 過濾低質量或有毒內容
- 在 GPU 集群上擴展數據處理
性能:
- 16 倍更快的模糊去重(8TB RedPajama v2)
- 與 CPU 替代方案相比,總擁有成本(TCO)降低 40%
- 跨 GPU 節點實現近線性擴展
改用其他替代方案:
- datatrove:基於 CPU 的開源數據處理工具
- dolma:Allen AI 的數據工具包
- Ray Data:通用機器學習數據處理(無專門的數據整理功能)
快速開始
安裝
# Text curation (CUDA 12)
uv pip install "nemo-curator[text_cuda12]"
# All modalities
uv pip install "nemo-curator[all_cuda12]"
# CPU-only (slower)
uv pip install "nemo-curator[cpu]"
基本文本整理流水線
from nemo_curator import ScoreFilter, Modify
from nemo_curator.datasets import DocumentDataset
import pandas as pd
# Load data
df = pd.DataFrame({"text": ["Good document", "Bad doc", "Excellent text"]})
dataset = DocumentDataset(df)
# Quality filtering
def quality_score(doc):
return len(doc["text"].split()) > 5 # Filter short docs
filtered = ScoreFilter(quality_score)(dataset)
# Deduplication
from nemo_curator.modules import ExactDuplicates
deduped = ExactDuplicates()(filtered)
# Save
deduped.to_parquet("curated_data/")
數據整理流水線
階段 1:質量過濾
from nemo_curator.filters import (
WordCountFilter,
RepeatedLinesFilter,
UrlRatioFilter,
NonAlphaNumericFilter
)
# Apply 30+ heuristic filters
from nemo_curator import ScoreFilter
# Word count filter
dataset = dataset.filter(WordCountFilter(min_words=50, max_words=100000))
# Remove repetitive content
dataset = dataset.filter(RepeatedLinesFilter(max_repeated_line_fraction=0.3))
# URL ratio filter
dataset = dataset.filter(UrlRatioFilter(max_url_ratio=0.2))
階段 2:去重
精確去重:
from nemo_curator.modules import ExactDuplicates
# Remove exact duplicates
deduped = ExactDuplicates(id_field="id", text_field="text")(dataset)
模糊去重(GPU 上速度提升 16 倍):
from nemo_curator.modules import FuzzyDuplicates
# MinHash + LSH deduplication
fuzzy_dedup = FuzzyDuplicates(
id_field="id",
text_field="text",
num_hashes=260, # MinHash parameters
num_buckets=20,
hash_method="md5"
)
deduped = fuzzy_dedup(dataset)
語義去重:
from nemo_curator.modules import SemanticDuplicates
# Embedding-based deduplication
semantic_dedup = SemanticDuplicates(
id_field="id",
text_field="text",
embedding_model="sentence-transformers/all-MiniLM-L6-v2",
threshold=0.8 # Cosine similarity threshold
)
deduped = semantic_dedup(dataset)
階段 3:PII 脫敏
from nemo_curator.modules import Modify
from nemo_curator.modifiers import PIIRedactor
# Redact personally identifiable information
pii_redactor = PIIRedactor(
supported_entities=["EMAIL_ADDRESS", "PHONE_NUMBER", "PERSON", "LOCATION"],
anonymize_action="replace" # or "redact"
)
redacted = Modify(pii_redactor)(dataset)
階段 4:分類器過濾
from nemo_curator.classifiers import QualityClassifier
# Quality classification
quality_clf = QualityClassifier(
model_path="nvidia/quality-classifier-deberta",
batch_size=256,
device="cuda"
)
# Filter low-quality documents
high_quality = dataset.filter(lambda doc: quality_clf(doc["text"]) > 0.5)
GPU 加速
GPU 與 CPU 性能對比
| 操作 | CPU (16 核) | GPU (A100) | 加速比 |
|---|---|---|---|
| 模糊去重 (8TB) | 120 小時 | 7.5 小時 | 16× |
| 精確去重 (1TB) | 8 小時 | 0.5 小時 | 16× |
| 質量過濾 | 2 小時 | 0.2 小時 | 10× |
多 GPU 擴展
from nemo_curator import get_client
import dask_cuda
# Initialize GPU cluster
client = get_client(cluster_type="gpu", n_workers=8)
# Process with 8 GPUs
deduped = FuzzyDuplicates(...)(dataset)
多模態整理
圖像整理
from nemo_curator.image import (
AestheticFilter,
NSFWFilter,
CLIPEmbedder
)
# Aesthetic scoring
aesthetic_filter = AestheticFilter(threshold=5.0)
filtered_images = aesthetic_filter(image_dataset)
# NSFW detection
nsfw_filter = NSFWFilter(threshold=0.9)
safe_images = nsfw_filter(filtered_images)
# Generate CLIP embeddings
clip_embedder = CLIPEmbedder(model="openai/clip-vit-base-patch32")
image_embeddings = clip_embedder(safe_images)
視頻整理
from nemo_curator.video import (
SceneDetector,
ClipExtractor,
InternVideo2Embedder
)
# Detect scenes
scene_detector = SceneDetector(threshold=27.0)
scenes = scene_detector(video_dataset)
# Extract clips
clip_extractor = ClipExtractor(min_duration=2.0, max_duration=10.0)
clips = clip_extractor(scenes)
# Generate embeddings
video_embedder = InternVideo2Embedder()
video_embeddings = video_embedder(clips)
音頻整理
from nemo_curator.audio import (
ASRInference,
WERFilter,
DurationFilter
)
# ASR transcription
asr = ASRInference(model="nvidia/stt_en_fastconformer_hybrid_large_pc")
transcribed = asr(audio_dataset)
# Filter by WER (word error rate)
wer_filter = WERFilter(max_wer=0.3)
high_quality_audio = wer_filter(transcribed)
# Duration filtering
duration_filter = DurationFilter(min_duration=1.0, max_duration=30.0)
filtered_audio = duration_filter(high_quality_audio)
常見模式
網絡抓取數據整理(Common Crawl)
from nemo_curator import ScoreFilter, Modify
from nemo_curator.filters import *
from nemo_curator.modules import *
from nemo_curator.datasets import DocumentDataset
# Load Common Crawl data
dataset = DocumentDataset.read_parquet("common_crawl/*.parquet")
# Pipeline
pipeline = [
# 1. Quality filtering
WordCountFilter(min_words=100, max_words=50000),
RepeatedLinesFilter(max_repeated_line_fraction=0.2),
SymbolToWordRatioFilter(max_symbol_to_word_ratio=0.3),
UrlRatioFilter(max_url_ratio=0.3),
# 2. Language filtering
LanguageIdentificationFilter(target_languages=["en"]),
# 3. Deduplication
ExactDuplicates(id_field="id", text_field="text"),
FuzzyDuplicates(id_field="id", text_field="text", num_hashes=260),
# 4. PII redaction
PIIRedactor(),
# 5. NSFW filtering
NSFWClassifier(threshold=0.8)
]
# Execute
for stage in pipeline:
dataset = stage(dataset)
# Save
dataset.to_parquet("curated_common_crawl/")
分佈式處理
from nemo_curator import get_client
from dask_cuda import LocalCUDACluster
# Multi-GPU cluster
cluster = LocalCUDACluster(n_workers=8)
client = get_client(cluster=cluster)
# Process large dataset
dataset = DocumentDataset.read_parquet("s3://large_dataset/*.parquet")
deduped = FuzzyDuplicates(...)(dataset)
# Cleanup
client.close()
cluster.close()
性能基準測試
模糊去重(8TB RedPajama v2)
- CPU (256 核):120 小時
- GPU (8× A100):7.5 小時
- 加速比:16×
精確去重(1TB)
- CPU (64 核):8 小時
- GPU (4× A100):0.5 小時
- 加速比:16×
質量過濾(100GB)
- CPU (32 核):2 小時
- GPU (2× A100):0.2 小時
- 加速比:10×
成本對比
基於 CPU 的整理(AWS c5.18xlarge × 10):
- 成本:$3.60/小時 × 10 = $36/小時
- 8TB 耗時:120 小時
- 總計:$4,320
基於 GPU 的整理(AWS p4d.24xlarge × 2):
- 成本:$32.77/小時 × 2 = $65.54/小時
- 8TB 耗時:7.5 小時
- 總計:$491.55
節省:成本降低 89%(節省 $3,828)
支持的數據格式
- 輸入:Parquet, JSONL, CSV
- 輸出:Parquet(推薦), JSONL
- WebDataset:用於多模態數據的 TAR 歸檔文件
用例
生產部署:
- NVIDIA 使用 NeMo Curator 準備 Nemotron-4 的訓練數據
- 已整理的開源數據集:RedPajama v2, The Pile
參考資料
資源
- GitHub: https://github.com/NVIDIA/NeMo-Curator ⭐ 500+
- 文檔: https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/
- 版本: 0.4.0+
- 許可證: Apache 2.0