Stable Diffusion 圖像生成
通過 HuggingFace Diffusers 使用 Stable Diffusion 模型進行最先進的文本到圖像生成。適用於從文本提示生成圖像、執行圖像到圖像的轉換、圖像修復(inpainting)或構建自定義擴散管道。
技能元數據
| 來源 | 可選 — 使用 hermes skills install official/mlops/stable-diffusion 安裝 |
| 路徑 | optional-skills/mlops/stable-diffusion |
| 版本 | 1.0.0 |
| 作者 | Orchestra Research |
| 許可證 | MIT |
| 依賴項 | diffusers>=0.30.0, transformers>=4.41.0, accelerate>=0.31.0, torch>=2.0.0 |
| 標籤 | Image Generation, Stable Diffusion, Diffusers, Text-to-Image, Multimodal, Computer Vision |
參考:完整 SKILL.md
信息
以下是 Hermes 在觸發此技能時加載的完整技能定義。這是技能激活時代理所看到的指令。
Stable Diffusion 圖像生成
使用 HuggingFace Diffusers 庫通過 Stable Diffusion 生成圖像的綜合指南。
何時使用 Stable Diffusion
在以下情況使用 Stable Diffusion:
- 從文本描述生成圖像
- 執行圖像到圖像的轉換(風格遷移、增強)
- 圖像修復(填充掩碼區域)
- 圖像擴展(將圖像延伸至邊界之外)
- 創建現有圖像的變體
- 構建自定義圖像生成工作流
主要功能:
- 文本到圖像:從自然語言提示生成圖像
- 圖像到圖像:在文本引導下轉換現有圖像
- 圖像修復:用上下文感知內容填充掩碼區域
- ControlNet:添加空間條件控制(邊緣、姿態、深度)
- LoRA 支持:高效的微調和風格適配
- 多種模型:支持 SD 1.5、SDXL、SD 3.0、Flux
改用其他替代方案:
- DALL-E 3:用於無需 GPU 的基於 API 的生成
- Midjourney:用於藝術化、風格化的輸出
- Imagen:用於 Google Cloud 集成
- Leonardo.ai:用於基於 Web 的創意工作流
快速開始
安裝
pip install diffusers transformers accelerate torch
pip install xformers # Optional: memory-efficient attention
基礎文本到圖像
from diffusers import DiffusionPipeline
import torch
# Load pipeline (auto-detects model type)
pipe = DiffusionPipeline.from_pretrained(
"stable-diffusion-v1-5/stable-diffusion-v1-5",
torch_dtype=torch.float16
)
pipe.to("cuda")
# Generate image
image = pipe(
"A serene mountain landscape at sunset, highly detailed",
num_inference_steps=50,
guidance_scale=7.5
).images[0]
image.save("output.png")
使用 SDXL(更高質量)
from diffusers import AutoPipelineForText2Image
import torch
pipe = AutoPipelineForText2Image.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16,
variant="fp16"
)
pipe.to("cuda")
# Enable memory optimization
pipe.enable_model_cpu_offload()
image = pipe(
prompt="A futuristic city with flying cars, cinematic lighting",
height=1024,
width=1024,
num_inference_steps=30
).images[0]
架構概述
三大支柱設計
Diffusers 圍繞三個核心組件構建:
Pipeline (orchestration)
├── Model (neural networks)
│ ├── UNet / Transformer (noise prediction)
│ ├── VAE (latent encoding/decoding)
│ └── Text Encoder (CLIP/T5)
└── Scheduler (denoising algorithm)
管道推理流程
Text Prompt → Text Encoder → Text Embeddings
↓
Random Noise → [Denoising Loop] ← Scheduler
↓
Predicted Noise
↓
VAE Decoder → Final Image
核心概念
管道 (Pipelines)
管道協調完整的工作流:
| 管道 | 用途 |
|---|---|
StableDiffusionPipeline | 文本到圖像 (SD 1.x/2.x) |
StableDiffusionXLPipeline | 文本到圖像 (SDXL) |
StableDiffusion3Pipeline | 文本到圖像 (SD 3.0) |
FluxPipeline | 文本到圖像 (Flux 模型) |
StableDiffusionImg2ImgPipeline | 圖像到圖像 |
StableDiffusionInpaintPipeline | 圖像修復 |
調度器 (Schedulers)
調度器控制去噪過程:
| 調度器 | 步數 | 質量 | 用例 |
|---|---|---|---|
EulerDiscreteScheduler | 20-50 | 良好 | 默認選擇 |
EulerAncestralDiscreteScheduler | 20-50 | 良好 | 更多變化 |
DPMSolverMultistepScheduler | 15-25 | 優秀 | 快速、高質量 |
DDIMScheduler | 50-100 | 良好 | 確定性 |
LCMScheduler | 4-8 | 良好 | 極快 |
UniPCMultistepScheduler | 15-25 | 優秀 | 快速收斂 |
交換調度器
from diffusers import DPMSolverMultistepScheduler
# Swap for faster generation
pipe.scheduler = DPMSolverMultistepScheduler.from_config(
pipe.scheduler.config
)
# Now generate with fewer steps
image = pipe(prompt, num_inference_steps=20).images[0]
生成參數
關鍵參數
| 參數 | 默認值 | 描述 |
|---|---|---|
prompt | 必填 | 所需圖像的文本描述 |
negative_prompt | 無 | 圖像中應避免的內容 |
num_inference_steps | 50 | 去噪步數(越多 = 質量越好) |
guidance_scale | 7.5 | 提示遵循度(通常為 7-12) |
height, width | 512/1024 | 輸出尺寸(8 的倍數) |
generator | 無 | 用於可復現性的 Torch 生成器 |
num_images_per_prompt | 1 | 批量大小 |
可復現生成
import torch
generator = torch.Generator(device="cuda").manual_seed(42)
image = pipe(
prompt="A cat wearing a top hat",
generator=generator,
num_inference_steps=50
).images[0]
負面提示 (Negative prompts)
image = pipe(
prompt="Professional photo of a dog in a garden",
negative_prompt="blurry, low quality, distorted, ugly, bad anatomy",
guidance_scale=7.5
).images[0]
圖像到圖像
在文本引導下轉換現有圖像:
from diffusers import AutoPipelineForImage2Image
from PIL import Image
pipe = AutoPipelineForImage2Image.from_pretrained(
"stable-diffusion-v1-5/stable-diffusion-v1-5",
torch_dtype=torch.float16
).to("cuda")
init_image = Image.open("input.jpg").resize((512, 512))
image = pipe(
prompt="A watercolor painting of the scene",
image=init_image,
strength=0.75, # How much to transform (0-1)
num_inference_steps=50
).images[0]
圖像修復
填充掩碼區域:
from diffusers import AutoPipelineForInpainting
from PIL import Image
pipe = AutoPipelineForInpainting.from_pretrained(
"runwayml/stable-diffusion-inpainting",
torch_dtype=torch.float16
).to("cuda")
image = Image.open("photo.jpg")
mask = Image.open("mask.png") # White = inpaint region
result = pipe(
prompt="A red car parked on the street",
image=image,
mask_image=mask,
num_inference_steps=50
).images[0]
ControlNet
添加空間條件控制以實現精確控制:
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
import torch
# Load ControlNet for edge conditioning
controlnet = ControlNetModel.from_pretrained(
"lllyasviel/control_v11p_sd15_canny",
torch_dtype=torch.float16
)
pipe = StableDiffusionControlNetPipeline.from_pretrained(
"stable-diffusion-v1-5/stable-diffusion-v1-5",
controlnet=controlnet,
torch_dtype=torch.float16
).to("cuda")
# Use Canny edge image as control
control_image = get_canny_image(input_image)
image = pipe(
prompt="A beautiful house in the style of Van Gogh",
image=control_image,
num_inference_steps=30
).images[0]
可用的 ControlNets
| ControlNet | 輸入類型 | 用例 |
|---|---|---|
canny | 邊緣圖 | 保留結構 |
openpose | 姿態骨架 | 人體姿態 |
depth | 深度圖 | 3D 感知生成 |
normal | 法線圖 | 表面細節 |
mlsd | 線段 | 建築線條 |
scribble | 粗略草圖 | 草圖到圖像 |
LoRA 適配器
加載微調後的風格適配器:
from diffusers import DiffusionPipeline
pipe = DiffusionPipeline.from_pretrained(
"stable-diffusion-v1-5/stable-diffusion-v1-5",
torch_dtype=torch.float16
).to("cuda")
# Load LoRA weights
pipe.load_lora_weights("path/to/lora", weight_name="style.safetensors")
# Generate with LoRA style
image = pipe("A portrait in the trained style").images[0]
# Adjust LoRA strength
pipe.fuse_lora(lora_scale=0.8)
# Unload LoRA
pipe.unload_lora_weights()
多個 LoRA
# Load multiple LoRAs
pipe.load_lora_weights("lora1", adapter_name="style")
pipe.load_lora_weights("lora2", adapter_name="character")
# Set weights for each
pipe.set_adapters(["style", "character"], adapter_weights=[0.7, 0.5])
image = pipe("A portrait").images[0]
內存優化
啟用 CPU 卸載
# Model CPU offload - moves models to CPU when not in use
pipe.enable_model_cpu_offload()
# Sequential CPU offload - more aggressive, slower
pipe.enable_sequential_cpu_offload()
注意力切片
# Reduce memory by computing attention in chunks
pipe.enable_attention_slicing()
# Or specific chunk size
pipe.enable_attention_slicing("max")
xFormers 內存高效注意力機制
# Requires xformers package
pipe.enable_xformers_memory_efficient_attention()
針對大圖像的 VAE 切片
# Decode latents in tiles for large images
pipe.enable_vae_slicing()
pipe.enable_vae_tiling()
模型變體
加載不同精度
# FP16 (recommended for GPU)
pipe = DiffusionPipeline.from_pretrained(
"model-id",
torch_dtype=torch.float16,
variant="fp16"
)
# BF16 (better precision, requires Ampere+ GPU)
pipe = DiffusionPipeline.from_pretrained(
"model-id",
torch_dtype=torch.bfloat16
)
加載特定組件
from diffusers import UNet2DConditionModel, AutoencoderKL
# Load custom VAE
vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse")
# Use with pipeline
pipe = DiffusionPipeline.from_pretrained(
"stable-diffusion-v1-5/stable-diffusion-v1-5",
vae=vae,
torch_dtype=torch.float16
)
批量生成
高效生成多張圖像:
# Multiple prompts
prompts = [
"A cat playing piano",
"A dog reading a book",
"A bird painting a picture"
]
images = pipe(prompts, num_inference_steps=30).images
# Multiple images per prompt
images = pipe(
"A beautiful sunset",
num_images_per_prompt=4,
num_inference_steps=30
).images
常見工作流
工作流 1:高質量生成
from diffusers import StableDiffusionXLPipeline, DPMSolverMultistepScheduler
import torch
# 1. Load SDXL with optimizations
pipe = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16,
variant="fp16"
)
pipe.to("cuda")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()
# 2. Generate with quality settings
image = pipe(
prompt="A majestic lion in the savanna, golden hour lighting, 8k, detailed fur",
negative_prompt="blurry, low quality, cartoon, anime, sketch",
num_inference_steps=30,
guidance_scale=7.5,
height=1024,
width=1024
).images[0]
工作流 2:快速原型開發
from diffusers import AutoPipelineForText2Image, LCMScheduler
import torch
# Use LCM for 4-8 step generation
pipe = AutoPipelineForText2Image.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16
).to("cuda")
# Load LCM LoRA for fast generation
pipe.load_lora_weights("latent-consistency/lcm-lora-sdxl")
pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)
pipe.fuse_lora()
# Generate in ~1 second
image = pipe(
"A beautiful landscape",
num_inference_steps=4,
guidance_scale=1.0
).images[0]
常見問題
CUDA 顯存不足:
# Enable memory optimizations
pipe.enable_model_cpu_offload()
pipe.enable_attention_slicing()
pipe.enable_vae_slicing()
# Or use lower precision
pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
黑色/噪聲圖像:
# Check VAE configuration
# Use safety checker bypass if needed
pipe.safety_checker = None
# Ensure proper dtype consistency
pipe = pipe.to(dtype=torch.float16)
生成速度慢:
# Use faster scheduler
from diffusers import DPMSolverMultistepScheduler
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
# Reduce steps
image = pipe(prompt, num_inference_steps=20).images[0]