跳到主要内容
翻译状态

该页面已从 Hermes Agent 官方文档同步,等待运行 pnpm docs:translate 生成简体中文译文。官方原文:https://github.com/NousResearch/hermes-agent/blob/main/website/docs/user-guide/skills/bundled/mlops/mlops-inference-outlines.md

Outlines

Guarantee valid JSON/XML/code structure during generation, use Pydantic models for type-safe outputs, support local models (Transformers, vLLM), and maximize inference speed with Outlines - dottxt.ai's structured generation library

Skill metadata

SourceBundled (installed by default)
Pathskills/mlops/inference/outlines
Version1.0.0
AuthorOrchestra Research
LicenseMIT
Dependenciesoutlines, transformers, vllm, pydantic
TagsPrompt Engineering, Outlines, Structured Generation, JSON Schema, Pydantic, Local Models, Grammar-Based Generation, vLLM, Transformers, Type Safety

Reference: full SKILL.md

信息

The following is the complete skill definition that Hermes loads when this skill is triggered. This is what the agent sees as instructions when the skill is active.

Outlines: Structured Text Generation

When to Use This Skill

Use Outlines when you need to:

  • Guarantee valid JSON/XML/code structure during generation
  • Use Pydantic models for type-safe outputs
  • Support local models (Transformers, llama.cpp, vLLM)
  • Maximize inference speed with zero-overhead structured generation
  • Generate against JSON schemas automatically
  • Control token sampling at the grammar level

GitHub Stars: 8,000+ | From: dottxt.ai (formerly .txt)

Installation

# Base installation
pip install outlines

# With specific backends
pip install outlines transformers # Hugging Face models
pip install outlines llama-cpp-python # llama.cpp
pip install outlines vllm # vLLM for high-throughput

Quick Start

Basic Example: Classification

import outlines
from typing import Literal

# Load model
model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")

# Generate with type constraint
prompt = "Sentiment of 'This product is amazing!': "
generator = outlines.generate.choice(model, ["positive", "negative", "neutral"])
sentiment = generator(prompt)

print(sentiment) # "positive" (guaranteed one of these)

With Pydantic Models

from pydantic import BaseModel
import outlines

class User(BaseModel):
name: str
age: int
email: str

model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")

# Generate structured output
prompt = "Extract user: John Doe, 30 years old, john@example.com"
generator = outlines.generate.json(model, User)
user = generator(prompt)

print(user.name) # "John Doe"
print(user.age) # 30
print(user.email) # "john@example.com"

Core Concepts

1. Constrained Token Sampling

Outlines uses Finite State Machines (FSM) to constrain token generation at the logit level.

How it works:

  1. Convert schema (JSON/Pydantic/regex) to context-free grammar (CFG)
  2. Transform CFG into Finite State Machine (FSM)
  3. Filter invalid tokens at each step during generation
  4. Fast-forward when only one valid token exists

Benefits:

  • Zero overhead: Filtering happens at token level
  • Speed improvement: Fast-forward through deterministic paths
  • Guaranteed validity: Invalid outputs impossible
import outlines

# Pydantic model -> JSON schema -> CFG -> FSM
class Person(BaseModel):
name: str
age: int

model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")

# Behind the scenes:
# 1. Person -> JSON schema
# 2. JSON schema -> CFG
# 3. CFG -> FSM
# 4. FSM filters tokens during generation

generator = outlines.generate.json(model, Person)
result = generator("Generate person: Alice, 25")

2. Structured Generators

Outlines provides specialized generators for different output types.

Choice Generator

# Multiple choice selection
generator = outlines.generate.choice(
model,
["positive", "negative", "neutral"]
)

sentiment = generator("Review: This is great!")
# Result: One of the three choices

JSON Generator

from pydantic import BaseModel

class Product(BaseModel):
name: str
price: float
in_stock: bool

# Generate valid JSON matching schema
generator = outlines.generate.json(model, Product)
product = generator("Extract: iPhone 15, $999, available")

# Guaranteed valid Product instance
print(type(product)) # <class '__main__.Product'>

Regex Generator

# Generate text matching regex
generator = outlines.generate.regex(
model,
r"[0-9]{3}-[0-9]{3}-[0-9]{4}" # Phone number pattern
)

phone = generator("Generate phone number:")
# Result: "555-123-4567" (guaranteed to match pattern)

Integer/Float Generators

# Generate specific numeric types
int_generator = outlines.generate.integer(model)
age = int_generator("Person's age:") # Guaranteed integer

float_generator = outlines.generate.float(model)
price = float_generator("Product price:") # Guaranteed float

3. Model Backends

Outlines supports multiple local and API-based backends.

Transformers (Hugging Face)

import outlines

# Load from Hugging Face
model = outlines.models.transformers(
"microsoft/Phi-3-mini-4k-instruct",
device="cuda" # Or "cpu"
)

# Use with any generator
generator = outlines.generate.json(model, YourModel)

llama.cpp

# Load GGUF model
model = outlines.models.llamacpp(
"./models/llama-3.1-8b-instruct.Q4_K_M.gguf",
n_gpu_layers=35
)

generator = outlines.generate.json(model, YourModel)

vLLM (High Throughput)

# For production deployments
model = outlines.models.vllm(
"meta-llama/Llama-3.1-8B-Instruct",
tensor_parallel_size=2 # Multi-GPU
)

generator = outlines.generate.json(model, YourModel)

OpenAI (Limited Support)

# Basic OpenAI support
model = outlines.models.openai(
"gpt-4o-mini",
api_key="your-api-key"
)

# Note: Some features limited with API models
generator = outlines.generate.json(model, YourModel)

4. Pydantic Integration

Outlines has first-class Pydantic support with automatic schema translation.

Basic Models

from pydantic import BaseModel, Field

class Article(BaseModel):
title: str = Field(description="Article title")
author: str = Field(description="Author name")
word_count: int = Field(description="Number of words", gt=0)
tags: list[str] = Field(description="List of tags")

model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")
generator = outlines.generate.json(model, Article)

article = generator("Generate article about AI")
print(article.title)
print(article.word_count) # Guaranteed > 0

Nested Models

class Address(BaseModel):
street: str
city: str
country: str

class Person(BaseModel):
name: str
age: int
address: Address # Nested model

generator = outlines.generate.json(model, Person)
person = generator("Generate person in New York")

print(person.address.city) # "New York"

Enums and Literals

from enum import Enum
from typing import Literal

class Status(str, Enum):
PENDING = "pending"
APPROVED = "approved"
REJECTED = "rejected"

class Application(BaseModel):
applicant: str
status: Status # Must be one of enum values
priority: Literal["low", "medium", "high"] # Must be one of literals

generator = outlines.generate.json(model, Application)
app = generator("Generate application")

print(app.status) # Status.PENDING (or APPROVED/REJECTED)

Common Patterns

Pattern 1: Data Extraction

from pydantic import BaseModel
import outlines

class CompanyInfo(BaseModel):
name: str
founded_year: int
industry: str
employees: int

model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")
generator = outlines.generate.json(model, CompanyInfo)

text = """
Apple Inc. was founded in 1976 in the technology industry.
The company employs approximately 164,000 people worldwide.
"""

prompt = f"Extract company information:\n{text}\n\nCompany:"
company = generator(prompt)

print(f"Name: {company.name}")
print(f"Founded: {company.founded_year}")
print(f"Industry: {company.industry}")
print(f"Employees: {company.employees}")

Pattern 2: Classification

from typing import Literal
import outlines

model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")

# Binary classification
generator = outlines.generate.choice(model, ["spam", "not_spam"])
result = generator("Email: Buy now! 50% off!")

# Multi-class classification
categories = ["technology", "business", "sports", "entertainment"]
category_gen = outlines.generate.choice(model, categories)
category = category_gen("Article: Apple announces new iPhone...")

# With confidence
class Classification(BaseModel):
label: Literal["positive", "negative", "neutral"]
confidence: float

classifier = outlines.generate.json(model, Classification)
result = classifier("Review: This product is okay, nothing special")

Pattern 3: Structured Forms

class UserProfile(BaseModel):
full_name: str
age: int
email: str
phone: str
country: str
interests: list[str]

model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")
generator = outlines.generate.json(model, UserProfile)

prompt = """
Extract user profile from:
Name: Alice Johnson
Age: 28
Email: alice@example.com
Phone: 555-0123
Country: USA
Interests: hiking, photography, cooking
"""

profile = generator(prompt)
print(profile.full_name)
print(profile.interests) # ["hiking", "photography", "cooking"]

Pattern 4: Multi-Entity Extraction

class Entity(BaseModel):
name: str
type: Literal["PERSON", "ORGANIZATION", "LOCATION"]

class DocumentEntities(BaseModel):
entities: list[Entity]

model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")
generator = outlines.generate.json(model, DocumentEntities)

text = "Tim Cook met with Satya Nadella at Microsoft headquarters in Redmond."
prompt = f"Extract entities from: {text}"

result = generator(prompt)
for entity in result.entities:
print(f"{entity.name} ({entity.type})")

Pattern 5: Code Generation

class PythonFunction(BaseModel):
function_name: str
parameters: list[str]
docstring: str
body: str

model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")
generator = outlines.generate.json(model, PythonFunction)

prompt = "Generate a Python function to calculate factorial"
func = generator(prompt)

print(f"def {func.function_name}({', '.join(func.parameters)}):")
print(f' """{func.docstring}"""')
print(f" {func.body}")

Pattern 6: Batch Processing

def batch_extract(texts: list[str], schema: type[BaseModel]):
"""Extract structured data from multiple texts."""
model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")
generator = outlines.generate.json(model, schema)

results = []
for text in texts:
result = generator(f"Extract from: {text}")
results.append(result)

return results

class Person(BaseModel):
name: str
age: int

texts = [
"John is 30 years old",
"Alice is 25 years old",
"Bob is 40 years old"
]

people = batch_extract(texts, Person)
for person in people:
print(f"{person.name}: {person.age}")

Backend Configuration

Transformers

import outlines

# Basic usage
model = outlines.models.transformers("microsoft/Phi-3-mini-4k-instruct")

# GPU configuration
model = outlines.models.transformers(
"microsoft/Phi-3-mini-4k-instruct",
device="cuda",
model_kwargs={"torch_dtype": "float16"}
)

# Popular models
model = outlines.models.transformers("meta-llama/Llama-3.1-8B-Instruct")
model = outlines.models.transformers("mistralai/Mistral-7B-Instruct-v0.3")
model = outlines.models.transformers("Qwen/Qwen2.5-7B-Instruct")

llama.cpp

# Load GGUF model
model = outlines.models.llamacpp(
"./models/llama-3.1-8b.Q4_K_M.gguf",
n_ctx=4096, # Context window
n_gpu_layers=35, # GPU layers
n_threads=8 # CPU threads
)

# Full GPU offload
model = outlines.models.llamacpp(
"./models/model.gguf",
n_gpu_layers=-1 # All layers on GPU
)

vLLM (Production)

# Single GPU
model = outlines.models.vllm("meta-llama/Llama-3.1-8B-Instruct")

# Multi-GPU
model = outlines.models.vllm(
"meta-llama/Llama-3.1-70B-Instruct",
tensor_parallel_size=4 # 4 GPUs
)

# With quantization
model = outlines.models.vllm(
"meta-llama/Llama-3.1-8B-Instruct",
quantization="awq" # Or "gptq"
)

Best Practices

1. Use Specific Types

# ✅ Good: Specific types
class Product(BaseModel):
name: str
price: float # Not str
quantity: int # Not str
in_stock: bool # Not str

# ❌ Bad: Everything as string
class Product(BaseModel):
name: str
price: str # Should be float
quantity: str # Should be int

2. Add Constraints

from pydantic import Field

# ✅ Good: With constraints
class User(BaseModel):
name: str = Field(min_length=1, max_length=100)
age: int = Field(ge=0, le=120)
email: str = Field(pattern=r"^[\w\.-]+@[\w\.-]+\.\w+$")

# ❌ Bad: No constraints
class User(BaseModel):
name: str
age: int
email: str

3. Use Enums for Categories

# ✅ Good: Enum for fixed set
class Priority(str, Enum):
LOW = "low"
MEDIUM = "medium"
HIGH = "high"

class Task(BaseModel):
title: str
priority: Priority

# ❌ Bad: Free-form string
class Task(BaseModel):
title: str
priority: str # Can be anything

4. Provide Context in Prompts

# ✅ Good: Clear context
prompt = """
Extract product information from the following text.
Text: iPhone 15 Pro costs $999 and is currently in stock.
Product:
"""

# ❌ Bad: Minimal context
prompt = "iPhone 15 Pro costs $999 and is currently in stock."

5. Handle Optional Fields

from typing import Optional

# ✅ Good: Optional fields for incomplete data
class Article(BaseModel):
title: str # Required
author: Optional[str] = None # Optional
date: Optional[str] = None # Optional
tags: list[str] = [] # Default empty list

# Can succeed even if author/date missing

Comparison to Alternatives

FeatureOutlinesInstructorGuidanceLMQL
Pydantic Support✅ Native✅ Native❌ No❌ No
JSON Schema✅ Yes✅ Yes⚠️ Limited✅ Yes
Regex Constraints✅ Yes❌ No✅ Yes✅ Yes
Local Models✅ Full⚠️ Limited✅ Full✅ Full
API Models⚠️ Limited✅ Full✅ Full✅ Full
Zero Overhead✅ Yes❌ No⚠️ Partial✅ Yes
Automatic Retrying❌ No✅ Yes❌ No❌ No
Learning CurveLowLowLowHigh

When to choose Outlines:

  • Using local models (Transformers, llama.cpp, vLLM)
  • Need maximum inference speed
  • Want Pydantic model support
  • Require zero-overhead structured generation
  • Control token sampling process

When to choose alternatives:

  • Instructor: Need API models with automatic retrying
  • Guidance: Need token healing and complex workflows
  • LMQL: Prefer declarative query syntax

Performance Characteristics

Speed:

  • Zero overhead: Structured generation as fast as unconstrained
  • Fast-forward optimization: Skips deterministic tokens
  • 1.2-2x faster than post-generation validation approaches

Memory:

  • FSM compiled once per schema (cached)
  • Minimal runtime overhead
  • Efficient with vLLM for high throughput

Accuracy:

  • 100% valid outputs (guaranteed by FSM)
  • No retry loops needed
  • Deterministic token filtering

Resources

See Also

  • references/json_generation.md - Comprehensive JSON and Pydantic patterns
  • references/backends.md - Backend-specific configuration
  • references/examples.md - Production-ready examples