跳到主要內容

Audiocraft 音頻生成

用於音頻生成的 PyTorch 庫,包括文本到音樂(MusicGen)和文本到聲音(AudioGen)。當您需要從文本描述生成音樂、創建音效或執行旋律條件音樂生成時使用。

技能元數據

來源捆綁(默認安裝)
路徑skills/mlops/models/audiocraft
版本1.0.0
作者Orchestra Research
許可證MIT
依賴項audiocraft, torch>=2.0.0, transformers>=4.30.0
標籤Multimodal, Audio Generation, Text-to-Music, Text-to-Audio, MusicGen

參考:完整 SKILL.md

信息

以下是 Hermes 在觸發此技能時加載的完整技能定義。這是技能激活時代理看到的指令。

AudioCraft:音頻生成

使用 MusicGen、AudioGen 和 EnCodec 通過 Meta 的 AudioCraft 進行文本到音樂和文本到音頻生成的綜合指南。

何時使用 AudioCraft

在以下情況使用 AudioCraft:

  • 需要從文本描述生成音樂
  • 創建音效和環境音頻
  • 構建音樂生成應用程序
  • 需要旋律條件音樂生成
  • 想要立體聲音頻輸出
  • 需要具有風格遷移的可控音樂生成

主要功能:

  • MusicGen:具有旋律條件的文本到音樂生成
  • AudioGen:文本到音效生成
  • EnCodec:高保真神經音頻編解碼器
  • 多種模型尺寸:從小型(300M)到大型(3.3B)
  • 立體聲支持:全立體聲音頻生成
  • 風格條件:MusicGen-Style 用於基於參考的生成

改用替代方案:

  • Stable Audio:用於更長的商業音樂生成
  • Bark:用於帶有音樂/音效的文本到語音
  • Riffusion:用於基於頻譜圖的音樂生成
  • OpenAI Jukebox:用於帶有歌詞的原始音頻生成

快速開始

安裝

# From PyPI
pip install audiocraft

# From GitHub (latest)
pip install git+https://github.com/facebookresearch/audiocraft.git

# Or use HuggingFace Transformers
pip install transformers torch torchaudio

基本文本到音樂(AudioCraft)

import torchaudio
from audiocraft.models import MusicGen

# Load model
model = MusicGen.get_pretrained('facebook/musicgen-small')

# Set generation parameters
model.set_generation_params(
duration=8, # seconds
top_k=250,
temperature=1.0
)

# Generate from text
descriptions = ["happy upbeat electronic dance music with synths"]
wav = model.generate(descriptions)

# Save audio
torchaudio.save("output.wav", wav[0].cpu(), sample_rate=32000)

使用 HuggingFace Transformers

from transformers import AutoProcessor, MusicgenForConditionalGeneration
import scipy

# Load model and processor
processor = AutoProcessor.from_pretrained("facebook/musicgen-small")
model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-small")
model.to("cuda")

# Generate music
inputs = processor(
text=["80s pop track with bassy drums and synth"],
padding=True,
return_tensors="pt"
).to("cuda")

audio_values = model.generate(
**inputs,
do_sample=True,
guidance_scale=3,
max_new_tokens=256
)

# Save
sampling_rate = model.config.audio_encoder.sampling_rate
scipy.io.wavfile.write("output.wav", rate=sampling_rate, data=audio_values[0, 0].cpu().numpy())

使用 AudioGen 進行文本到聲音生成

from audiocraft.models import AudioGen

# Load AudioGen
model = AudioGen.get_pretrained('facebook/audiogen-medium')

model.set_generation_params(duration=5)

# Generate sound effects
descriptions = ["dog barking in a park with birds chirping"]
wav = model.generate(descriptions)

torchaudio.save("sound.wav", wav[0].cpu(), sample_rate=16000)

核心概念

架構概述

AudioCraft Architecture:
┌──────────────────────────────────────────────────────────────┐
│ Text Encoder (T5) │
│ │ │
│ Text Embeddings │
└────────────────────────┬─────────────────────────────────────┘

┌────────────────────────▼─────────────────────────────────────┐
│ Transformer Decoder (LM) │
│ Auto-regressively generates audio tokens │
│ Using efficient token interleaving patterns │
└────────────────────────┬─────────────────────────────────────┘

┌────────────────────────▼─────────────────────────────────────┐
│ EnCodec Audio Decoder │
│ Converts tokens back to audio waveform │
└──────────────────────────────────────────────────────────────┘

模型變體

模型尺寸描述用例
musicgen-small300M文本到音樂快速生成
musicgen-medium1.5B文本到音樂平衡型
musicgen-large3.3B文本到音樂最佳質量
musicgen-melody1.5B文本 + 旋律旋律條件
musicgen-melody-large3.3B文本 + 旋律最佳旋律
musicgen-stereo-*可變立體聲輸出立體聲生成
musicgen-style1.5B風格遷移基於參考
audiogen-medium1.5B文本到聲音音效

生成參數

參數默認值描述
duration8.0長度(秒)(1-120)
top_k250Top-k 採樣
top_p0.0核採樣(0 = 禁用)
temperature1.0採樣溫度
cfg_coef3.0無分類器引導

MusicGen 用法

文本到音樂生成

from audiocraft.models import MusicGen
import torchaudio

model = MusicGen.get_pretrained('facebook/musicgen-medium')

# Configure generation
model.set_generation_params(
duration=30, # Up to 30 seconds
top_k=250, # Sampling diversity
top_p=0.0, # 0 = use top_k only
temperature=1.0, # Creativity (higher = more varied)
cfg_coef=3.0 # Text adherence (higher = stricter)
)

# Generate multiple samples
descriptions = [
"epic orchestral soundtrack with strings and brass",
"chill lo-fi hip hop beat with jazzy piano",
"energetic rock song with electric guitar"
]

# Generate (returns [batch, channels, samples])
wav = model.generate(descriptions)

# Save each
for i, audio in enumerate(wav):
torchaudio.save(f"music_{i}.wav", audio.cpu(), sample_rate=32000)

旋律條件生成

from audiocraft.models import MusicGen
import torchaudio

# Load melody model
model = MusicGen.get_pretrained('facebook/musicgen-melody')
model.set_generation_params(duration=30)

# Load melody audio
melody, sr = torchaudio.load("melody.wav")

# Generate with melody conditioning
descriptions = ["acoustic guitar folk song"]
wav = model.generate_with_chroma(descriptions, melody, sr)

torchaudio.save("melody_conditioned.wav", wav[0].cpu(), sample_rate=32000)

立體聲生成

from audiocraft.models import MusicGen

# Load stereo model
model = MusicGen.get_pretrained('facebook/musicgen-stereo-medium')
model.set_generation_params(duration=15)

descriptions = ["ambient electronic music with wide stereo panning"]
wav = model.generate(descriptions)

# wav shape: [batch, 2, samples] for stereo
print(f"Stereo shape: {wav.shape}") # [1, 2, 480000]
torchaudio.save("stereo.wav", wav[0].cpu(), sample_rate=32000)

音頻續寫

from transformers import AutoProcessor, MusicgenForConditionalGeneration

processor = AutoProcessor.from_pretrained("facebook/musicgen-medium")
model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-medium")

# Load audio to continue
import torchaudio
audio, sr = torchaudio.load("intro.wav")

# Process with text and audio
inputs = processor(
audio=audio.squeeze().numpy(),
sampling_rate=sr,
text=["continue with a epic chorus"],
padding=True,
return_tensors="pt"
)

# Generate continuation
audio_values = model.generate(**inputs, do_sample=True, guidance_scale=3, max_new_tokens=512)

MusicGen-Style 用法

風格條件生成

from audiocraft.models import MusicGen

# Load style model
model = MusicGen.get_pretrained('facebook/musicgen-style')

# Configure generation with style
model.set_generation_params(
duration=30,
cfg_coef=3.0,
cfg_coef_beta=5.0 # Style influence
)

# Configure style conditioner
model.set_style_conditioner_params(
eval_q=3, # RVQ quantizers (1-6)
excerpt_length=3.0 # Style excerpt length
)

# Load style reference
style_audio, sr = torchaudio.load("reference_style.wav")

# Generate with text + style
descriptions = ["upbeat dance track"]
wav = model.generate_with_style(descriptions, style_audio, sr)

純風格生成(無文本)

# Generate matching style without text prompt
model.set_generation_params(
duration=30,
cfg_coef=3.0,
cfg_coef_beta=None # Disable double CFG for style-only
)

wav = model.generate_with_style([None], style_audio, sr)

AudioGen 用法

音效生成

from audiocraft.models import AudioGen
import torchaudio

model = AudioGen.get_pretrained('facebook/audiogen-medium')
model.set_generation_params(duration=10)

# Generate various sounds
descriptions = [
"thunderstorm with heavy rain and lightning",
"busy city traffic with car horns",
"ocean waves crashing on rocks",
"crackling campfire in forest"
]

wav = model.generate(descriptions)

for i, audio in enumerate(wav):
torchaudio.save(f"sound_{i}.wav", audio.cpu(), sample_rate=16000)

EnCodec 用法

音頻壓縮

from audiocraft.models import CompressionModel
import torch
import torchaudio

# Load EnCodec
model = CompressionModel.get_pretrained('facebook/encodec_32khz')

# Load audio
wav, sr = torchaudio.load("audio.wav")

# Ensure correct sample rate
if sr != 32000:
resampler = torchaudio.transforms.Resample(sr, 32000)
wav = resampler(wav)

# Encode to tokens
with torch.no_grad():
encoded = model.encode(wav.unsqueeze(0))
codes = encoded[0] # Audio codes

# Decode back to audio
with torch.no_grad():
decoded = model.decode(codes)

torchaudio.save("reconstructed.wav", decoded[0].cpu(), sample_rate=32000)

常見工作流

工作流 1:音樂生成流水線

import torch
import torchaudio
from audiocraft.models import MusicGen

class MusicGenerator:
def __init__(self, model_name="facebook/musicgen-medium"):
self.model = MusicGen.get_pretrained(model_name)
self.sample_rate = 32000

def generate(self, prompt, duration=30, temperature=1.0, cfg=3.0):
self.model.set_generation_params(
duration=duration,
top_k=250,
temperature=temperature,
cfg_coef=cfg
)

with torch.no_grad():
wav = self.model.generate([prompt])

return wav[0].cpu()

def generate_batch(self, prompts, duration=30):
self.model.set_generation_params(duration=duration)

with torch.no_grad():
wav = self.model.generate(prompts)

return wav.cpu()

def save(self, audio, path):
torchaudio.save(path, audio, sample_rate=self.sample_rate)

# Usage
generator = MusicGenerator()
audio = generator.generate(
"epic cinematic orchestral music",
duration=30,
temperature=1.0
)
generator.save(audio, "epic_music.wav")

工作流 2:聲音設計批處理

import json
from pathlib import Path
from audiocraft.models import AudioGen
import torchaudio

def batch_generate_sounds(sound_specs, output_dir):
"""
Generate multiple sounds from specifications.

Args:
sound_specs: list of {"name": str, "description": str, "duration": float}
output_dir: output directory path
"""
model = AudioGen.get_pretrained('facebook/audiogen-medium')
output_dir = Path(output_dir)
output_dir.mkdir(exist_ok=True)

results = []

for spec in sound_specs:
model.set_generation_params(duration=spec.get("duration", 5))

wav = model.generate([spec["description"]])

output_path = output_dir / f"{spec['name']}.wav"
torchaudio.save(str(output_path), wav[0].cpu(), sample_rate=16000)

results.append({
"name": spec["name"],
"path": str(output_path),
"description": spec["description"]
})

return results

# Usage
sounds = [
{"name": "explosion", "description": "massive explosion with debris", "duration": 3},
{"name": "footsteps", "description": "footsteps on wooden floor", "duration": 5},
{"name": "door", "description": "wooden door creaking and closing", "duration": 2}
]

results = batch_generate_sounds(sounds, "sound_effects/")

工作流 3:Gradio 演示

import gradio as gr
import torch
import torchaudio
from audiocraft.models import MusicGen

model = MusicGen.get_pretrained('facebook/musicgen-small')

def generate_music(prompt, duration, temperature, cfg_coef):
model.set_generation_params(
duration=duration,
temperature=temperature,
cfg_coef=cfg_coef
)

with torch.no_grad():
wav = model.generate([prompt])

# Save to temp file
path = "temp_output.wav"
torchaudio.save(path, wav[0].cpu(), sample_rate=32000)
return path

demo = gr.Interface(
fn=generate_music,
inputs=[
gr.Textbox(label="Music Description", placeholder="upbeat electronic dance music"),
gr.Slider(1, 30, value=8, label="Duration (seconds)"),
gr.Slider(0.5, 2.0, value=1.0, label="Temperature"),
gr.Slider(1.0, 10.0, value=3.0, label="CFG Coefficient")
],
outputs=gr.Audio(label="Generated Music"),
title="MusicGen Demo"
)

demo.launch()

性能優化

內存優化

# Use smaller model
model = MusicGen.get_pretrained('facebook/musicgen-small')

# Clear cache between generations
torch.cuda.empty_cache()

# Generate shorter durations
model.set_generation_params(duration=10) # Instead of 30

# Use half precision
model = model.half()

批處理效率

# Process multiple prompts at once (more efficient)
descriptions = ["prompt1", "prompt2", "prompt3", "prompt4"]
wav = model.generate(descriptions) # Single batch

# Instead of
for desc in descriptions:
wav = model.generate([desc]) # Multiple batches (slower)

GPU 內存要求

模型FP32 顯存FP16 顯存
musicgen-small~4GB~2GB
musicgen-medium~8GB~4GB
musicgen-large~16GB~8GB

常見問題

問題解決方案
CUDA OOM使用較小的模型,減少持續時間
質量差增加 cfg_coef,使用更好的提示詞
生成時間太短檢查最大持續時間設置
音頻偽影嘗試不同的溫度
立體聲不起作用使用立體聲模型變體

參考資料

資源