Stable Diffusion 圖像生成

通過 HuggingFace Diffusers 使用 Stable Diffusion 模型進行最先進的文本到圖像生成。適用於從文本提示生成圖像、執行圖像到圖像的轉換、圖像修復（inpainting）或構建自定義擴散管道。

技能元數據


來源	可選 — 使用 `hermes skills install official/mlops/stable-diffusion` 安裝
路徑	`optional-skills/mlops/stable-diffusion`
版本	`1.0.0`
作者	Orchestra Research
許可證	MIT
依賴項	`diffusers>=0.30.0`, `transformers>=4.41.0`, `accelerate>=0.31.0`, `torch>=2.0.0`
標籤	`Image Generation`, `Stable Diffusion`, `Diffusers`, `Text-to-Image`, `Multimodal`, `Computer Vision`

參考：完整 SKILL.md

信息

以下是 Hermes 在觸發此技能時加載的完整技能定義。這是技能激活時代理所看到的指令。

Stable Diffusion 圖像生成

使用 HuggingFace Diffusers 庫通過 Stable Diffusion 生成圖像的綜合指南。

何時使用 Stable Diffusion

在以下情況使用 Stable Diffusion：

從文本描述生成圖像
執行圖像到圖像的轉換（風格遷移、增強）
圖像修復（填充掩碼區域）
圖像擴展（將圖像延伸至邊界之外）
創建現有圖像的變體
構建自定義圖像生成工作流

主要功能：

文本到圖像：從自然語言提示生成圖像
圖像到圖像：在文本引導下轉換現有圖像
圖像修復：用上下文感知內容填充掩碼區域
ControlNet：添加空間條件控制（邊緣、姿態、深度）
LoRA 支持：高效的微調和風格適配
多種模型：支持 SD 1.5、SDXL、SD 3.0、Flux

改用其他替代方案：

DALL-E 3：用於無需 GPU 的基於 API 的生成
Midjourney：用於藝術化、風格化的輸出
Imagen：用於 Google Cloud 集成
Leonardo.ai：用於基於 Web 的創意工作流

快速開始

安裝

pip install diffusers transformers accelerate torch
pip install xformers  # Optional: memory-efficient attention

基礎文本到圖像

from diffusers import DiffusionPipeline
import torch

# Load pipeline (auto-detects model type)
pipe = DiffusionPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    torch_dtype=torch.float16
)
pipe.to("cuda")

# Generate image
image = pipe(
    "A serene mountain landscape at sunset, highly detailed",
    num_inference_steps=50,
    guidance_scale=7.5
).images[0]

image.save("output.png")

使用 SDXL（更高質量）

from diffusers import AutoPipelineForText2Image
import torch

pipe = AutoPipelineForText2Image.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    variant="fp16"
)
pipe.to("cuda")

# Enable memory optimization
pipe.enable_model_cpu_offload()

image = pipe(
    prompt="A futuristic city with flying cars, cinematic lighting",
    height=1024,
    width=1024,
    num_inference_steps=30
).images[0]

架構概述

三大支柱設計

Diffusers 圍繞三個核心組件構建：

Pipeline (orchestration)
├── Model (neural networks)
│   ├── UNet / Transformer (noise prediction)
│   ├── VAE (latent encoding/decoding)
│   └── Text Encoder (CLIP/T5)
└── Scheduler (denoising algorithm)

管道推理流程

Text Prompt → Text Encoder → Text Embeddings
                                    ↓
Random Noise → [Denoising Loop] ← Scheduler
                      ↓
               Predicted Noise
                      ↓
              VAE Decoder → Final Image

核心概念

管道 (Pipelines)

管道協調完整的工作流：

管道	用途
`StableDiffusionPipeline`	文本到圖像 (SD 1.x/2.x)
`StableDiffusionXLPipeline`	文本到圖像 (SDXL)
`StableDiffusion3Pipeline`	文本到圖像 (SD 3.0)
`FluxPipeline`	文本到圖像 (Flux 模型)
`StableDiffusionImg2ImgPipeline`	圖像到圖像
`StableDiffusionInpaintPipeline`	圖像修復

調度器 (Schedulers)

調度器控制去噪過程：

調度器	步數	質量	用例
`EulerDiscreteScheduler`	20-50	良好	默認選擇
`EulerAncestralDiscreteScheduler`	20-50	良好	更多變化
`DPMSolverMultistepScheduler`	15-25	優秀	快速、高質量
`DDIMScheduler`	50-100	良好	確定性
`LCMScheduler`	4-8	良好	極快
`UniPCMultistepScheduler`	15-25	優秀	快速收斂

交換調度器

from diffusers import DPMSolverMultistepScheduler

# Swap for faster generation
pipe.scheduler = DPMSolverMultistepScheduler.from_config(
    pipe.scheduler.config
)

# Now generate with fewer steps
image = pipe(prompt, num_inference_steps=20).images[0]

生成參數

關鍵參數

參數	默認值	描述
`prompt`	必填	所需圖像的文本描述
`negative_prompt`	無	圖像中應避免的內容
`num_inference_steps`	50	去噪步數（越多 = 質量越好）
`guidance_scale`	7.5	提示遵循度（通常為 7-12）
`height`, `width`	512/1024	輸出尺寸（8 的倍數）
`generator`	無	用於可復現性的 Torch 生成器
`num_images_per_prompt`	1	批量大小

可復現生成

import torch

generator = torch.Generator(device="cuda").manual_seed(42)

image = pipe(
    prompt="A cat wearing a top hat",
    generator=generator,
    num_inference_steps=50
).images[0]

負面提示 (Negative prompts)

image = pipe(
    prompt="Professional photo of a dog in a garden",
    negative_prompt="blurry, low quality, distorted, ugly, bad anatomy",
    guidance_scale=7.5
).images[0]

圖像到圖像

在文本引導下轉換現有圖像：

from diffusers import AutoPipelineForImage2Image
from PIL import Image

pipe = AutoPipelineForImage2Image.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    torch_dtype=torch.float16
).to("cuda")

init_image = Image.open("input.jpg").resize((512, 512))

image = pipe(
    prompt="A watercolor painting of the scene",
    image=init_image,
    strength=0.75,  # How much to transform (0-1)
    num_inference_steps=50
).images[0]

圖像修復

填充掩碼區域：

from diffusers import AutoPipelineForInpainting
from PIL import Image

pipe = AutoPipelineForInpainting.from_pretrained(
    "runwayml/stable-diffusion-inpainting",
    torch_dtype=torch.float16
).to("cuda")

image = Image.open("photo.jpg")
mask = Image.open("mask.png")  # White = inpaint region

result = pipe(
    prompt="A red car parked on the street",
    image=image,
    mask_image=mask,
    num_inference_steps=50
).images[0]

ControlNet

添加空間條件控制以實現精確控制：

from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
import torch

# Load ControlNet for edge conditioning
controlnet = ControlNetModel.from_pretrained(
    "lllyasviel/control_v11p_sd15_canny",
    torch_dtype=torch.float16
)

pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    controlnet=controlnet,
    torch_dtype=torch.float16
).to("cuda")

# Use Canny edge image as control
control_image = get_canny_image(input_image)

image = pipe(
    prompt="A beautiful house in the style of Van Gogh",
    image=control_image,
    num_inference_steps=30
).images[0]

可用的 ControlNets

ControlNet	輸入類型	用例
`canny`	邊緣圖	保留結構
`openpose`	姿態骨架	人體姿態
`depth`	深度圖	3D 感知生成
`normal`	法線圖	表面細節
`mlsd`	線段	建築線條
`scribble`	粗略草圖	草圖到圖像

LoRA 適配器

加載微調後的風格適配器：

from diffusers import DiffusionPipeline

pipe = DiffusionPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    torch_dtype=torch.float16
).to("cuda")

# Load LoRA weights
pipe.load_lora_weights("path/to/lora", weight_name="style.safetensors")

# Generate with LoRA style
image = pipe("A portrait in the trained style").images[0]

# Adjust LoRA strength
pipe.fuse_lora(lora_scale=0.8)

# Unload LoRA
pipe.unload_lora_weights()

多個 LoRA

# Load multiple LoRAs
pipe.load_lora_weights("lora1", adapter_name="style")
pipe.load_lora_weights("lora2", adapter_name="character")

# Set weights for each
pipe.set_adapters(["style", "character"], adapter_weights=[0.7, 0.5])

image = pipe("A portrait").images[0]

內存優化

啟用 CPU 卸載

# Model CPU offload - moves models to CPU when not in use
pipe.enable_model_cpu_offload()

# Sequential CPU offload - more aggressive, slower
pipe.enable_sequential_cpu_offload()

注意力切片

# Reduce memory by computing attention in chunks
pipe.enable_attention_slicing()

# Or specific chunk size
pipe.enable_attention_slicing("max")

xFormers 內存高效注意力機制

# Requires xformers package
pipe.enable_xformers_memory_efficient_attention()

針對大圖像的 VAE 切片

# Decode latents in tiles for large images
pipe.enable_vae_slicing()
pipe.enable_vae_tiling()

模型變體

加載不同精度

# FP16 (recommended for GPU)
pipe = DiffusionPipeline.from_pretrained(
    "model-id",
    torch_dtype=torch.float16,
    variant="fp16"
)

# BF16 (better precision, requires Ampere+ GPU)
pipe = DiffusionPipeline.from_pretrained(
    "model-id",
    torch_dtype=torch.bfloat16
)

加載特定組件

from diffusers import UNet2DConditionModel, AutoencoderKL

# Load custom VAE
vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse")

# Use with pipeline
pipe = DiffusionPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    vae=vae,
    torch_dtype=torch.float16
)

批量生成

高效生成多張圖像：

# Multiple prompts
prompts = [
    "A cat playing piano",
    "A dog reading a book",
    "A bird painting a picture"
]

images = pipe(prompts, num_inference_steps=30).images

# Multiple images per prompt
images = pipe(
    "A beautiful sunset",
    num_images_per_prompt=4,
    num_inference_steps=30
).images

常見工作流

工作流 1：高質量生成

from diffusers import StableDiffusionXLPipeline, DPMSolverMultistepScheduler
import torch

# 1. Load SDXL with optimizations
pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    variant="fp16"
)
pipe.to("cuda")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()

# 2. Generate with quality settings
image = pipe(
    prompt="A majestic lion in the savanna, golden hour lighting, 8k, detailed fur",
    negative_prompt="blurry, low quality, cartoon, anime, sketch",
    num_inference_steps=30,
    guidance_scale=7.5,
    height=1024,
    width=1024
).images[0]

工作流 2：快速原型開發

from diffusers import AutoPipelineForText2Image, LCMScheduler
import torch

# Use LCM for 4-8 step generation
pipe = AutoPipelineForText2Image.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16
).to("cuda")

# Load LCM LoRA for fast generation
pipe.load_lora_weights("latent-consistency/lcm-lora-sdxl")
pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)
pipe.fuse_lora()

# Generate in ~1 second
image = pipe(
    "A beautiful landscape",
    num_inference_steps=4,
    guidance_scale=1.0
).images[0]

常見問題

CUDA 顯存不足：

# Enable memory optimizations
pipe.enable_model_cpu_offload()
pipe.enable_attention_slicing()
pipe.enable_vae_slicing()

# Or use lower precision
pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)

黑色/噪聲圖像：

# Check VAE configuration
# Use safety checker bypass if needed
pipe.safety_checker = None

# Ensure proper dtype consistency
pipe = pipe.to(dtype=torch.float16)

生成速度慢：

# Use faster scheduler
from diffusers import DPMSolverMultistepScheduler
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)

# Reduce steps
image = pipe(prompt, num_inference_steps=20).images[0]

參考資料

高級用法 - 自定義管道、微調、部署
故障排除 - 常見問題及解決方案

資源

文檔: https://huggingface.co/docs/diffusers
倉庫: https://github.com/huggingface/diffusers
模型中心: https://huggingface.co/models?library=diffusers
Discord: https://discord.gg/diffusers

技能元數據​

參考：完整 SKILL.md​

Stable Diffusion 圖像生成

何時使用 Stable Diffusion​

快速開始​

安裝​

基礎文本到圖像​

使用 SDXL（更高質量）​

架構概述​

三大支柱設計​

管道推理流程​

核心概念​

管道 (Pipelines)​

調度器 (Schedulers)​

交換調度器​

生成參數​

關鍵參數​

可復現生成​

負面提示 (Negative prompts)​

圖像到圖像​

圖像修復​

ControlNet​

可用的 ControlNets​

LoRA 適配器​

多個 LoRA​

內存優化​

啟用 CPU 卸載​

注意力切片​

xFormers 內存高效注意力機制​

針對大圖像的 VAE 切片​

模型變體​

加載不同精度​

加載特定組件​

批量生成​

常見工作流​

工作流 1：高質量生成​

工作流 2：快速原型開發​

常見問題​

參考資料​

資源​

技能元數據

參考：完整 SKILL.md

何時使用 Stable Diffusion

快速開始

安裝

基礎文本到圖像

使用 SDXL（更高質量）

架構概述

三大支柱設計

管道推理流程

核心概念

管道 (Pipelines)

調度器 (Schedulers)

交換調度器

生成參數

關鍵參數

可復現生成

負面提示 (Negative prompts)

圖像到圖像

圖像修復

ControlNet

可用的 ControlNets

LoRA 適配器

多個 LoRA

內存優化

啟用 CPU 卸載

注意力切片

xFormers 內存高效注意力機制

針對大圖像的 VAE 切片

模型變體

加載不同精度

加載特定組件

批量生成

常見工作流

工作流 1：高質量生成

工作流 2：快速原型開發

常見問題

參考資料

資源