Taming LLMs: How to Get Structured Output Every Time (Even for Big Responses)
Shrijith Venkatramana

Shrijith Venkatramana @shrsv

About: Founder @ hexmos.com

Joined:
Jan 4, 2023

Taming LLMs: How to Get Structured Output Every Time (Even for Big Responses)

Publish Date: Jul 11
5 0

Hi there! I'm Shrijith Venkatrama, founder of Hexmos. Right now, I’m building LiveAPI, a first of its kind tool for helping you automatically index API endpoints across all your repositories. LiveAPI helps you discover, understand and use APIs in large tech infrastructures with ease.

Large language models (LLMs) are powerful, but getting them to produce structured output—like JSON, specific types, or regex-compliant text—can feel like herding cats. Tools like Outlines make this easier by guaranteeing structured output directly during generation, even for large, multi-part responses. This post dives into how Outlines works, why it’s a game-changer for developers, and how you can use it to avoid parsing nightmares. We’ll explore code examples, key concepts, and practical tips to make your LLM projects more reliable.

Why Structured Output Matters for Developers

LLMs often generate freeform text, which is great for creative tasks but a headache when you need structured data like JSON, integers, or specific formats. Parsing raw LLM output is error-prone—think broken JSON, inconsistent formats, or extra fluff. Outlines solves this by enforcing structure at the generation step, not after. This means:

  • No post-processing hacks to clean up messy output.
  • Guaranteed valid formats, even if the model’s response is truncated and needs to continue.
  • Works with any LLM, from OpenAI to local models like Phi-3.

This approach is perfect for tasks like API response formatting, customer support ticket parsing, or extracting structured data from text. Let’s break down how it works.

How Outlines Guarantees Structured Output

Outlines uses a technique called constrained decoding to ensure LLMs produce valid structured output. Instead of letting the model generate any token, Outlines masks invalid tokens during generation, so the output always matches your specified structure. This is powered by Finite State Machines (FSMs), which define the allowed sequence of tokens based on your desired format.

Key mechanics:

  • Token masking (logit biasing): Outlines modifies the model’s probability distribution, setting invalid tokens to zero probability.
  • FSMs for structure: The FSM tracks the valid next tokens based on your defined structure (e.g., JSON schema, regex, or Python type).
  • Resuming on truncation: If an LLM’s output is cut off (e.g., due to token limits), Outlines saves the FSM state and resumes generation, ensuring the final output stays valid.

For example, if you want JSON output, Outlines ensures every token follows JSON syntax, preventing issues like missing braces or invalid keys. This is a big deal for large responses, where truncation is common in streaming or limited-token scenarios.

Example: Generating a JSON object

import outlines
from transformers import AutoTokenizer, AutoModelForCausalLM
from pydantic import BaseModel

# Define a structured output using Pydantic
class Product(BaseModel):
    name: str
    price: float
    in_stock: bool

# Load a model
MODEL_NAME = "microsoft/Phi-3-mini-4k-instruct"
model = outlines.from_transformers(
    AutoModelForCausalLM.from_pretrained(MODEL_NAME, device_map="auto"),
    AutoTokenizer.from_pretrained(MODEL_NAME)
)

# Generate structured output
prompt = "Extract product details: 'Laptop, $999.99, available now'"
result = model(prompt, Product)

print(result)  # Output: Product(name='Laptop', price=999.99, in_stock=True)
Enter fullscreen mode Exit fullscreen mode

Why this works: Outlines ensures the output matches the Product schema, even if the model tries to go off-script. If the response is truncated mid-generation, Outlines can pick up where it left off, maintaining the structure.

Learn more in the Outlines README.

Handling Truncated Outputs Without Breaking Structure

One of the trickiest problems with LLMs is handling truncation—when the model hits a token limit or streaming cuts off the response. Without careful handling, you might end up with half a JSON object or an invalid regex match. Outlines solves this by tracking the FSM state and resuming generation seamlessly.

How it works:

  • Outlines builds an FSM for your output structure (e.g., JSON schema or regex).
  • If generation stops early, the FSM state is saved.
  • When you call “continue,” Outlines resumes from the last valid state, masking invalid tokens to keep the output consistent.

Example: Resuming a truncated JSON response

import outlines
from transformers import AutoTokenizer, AutoModelForCausalLM
from pydantic import BaseModel

# Define a complex schema
class Order(BaseModel):
    order_id: int
    items: list[str]
    total: float

# Load model
MODEL_NAME = "microsoft/Phi-3-mini-4k-instruct"
model = outlines.from_transformers(
    AutoModelForCausalLM.from_pretrained(MODEL_NAME, device_map="auto"),
    AutoTokenizer.from_pretrained(MODEL_NAME)
)

# Simulate a truncated response
prompt = "Customer order: ID 123, items: shirt, shoes, total: $45.50"
partial_result = model(prompt, Order, max_tokens=20)  # Truncates early
print(partial_result)  # Output: Order(order_id=123, items=['shirt'], total=None) [Truncated]

# Resume generation
full_result = model.continue_generation(partial_result)
print(full_result)  # Output: Order(order_id=123, items=['shirt', 'shoes'], total=45.50)
Enter fullscreen mode Exit fullscreen mode

What’s happening: The initial generation stops early, but Outlines’ FSM ensures the partial output is valid. When resumed, it picks up the FSM state and completes the structure correctly.

This is critical for large responses or streaming applications, where output might come in chunks but still needs to adhere to a strict format.

Practical Use Cases and Code Examples

Outlines shines in real-world scenarios where structured output is non-negotiable. Here are some common use cases and how to implement them.

Use Case 1: Customer Support Ticket Parsing

You need to extract structured data from a customer complaint. Outlines can enforce a schema like this:

import outlines
from transformers import AutoTokenizer, AutoModelForCausalLM
from pydantic import BaseModel
from typing import Literal

# Define ticket schema
class SupportTicket(BaseModel):
    priority: Literal["Low", "Medium", "High"]
    category: str
    description: str

# Load model
MODEL_NAME = "microsoft/Phi-3-mini-4k-instruct"
model = outlines.from_transformers(
    AutoModelForCausalLM.from_pretrained(MODEL_NAME, device_map="auto"),
    AutoTokenizer.from_pretrained(MODEL_NAME)
)

# Parse ticket
prompt = "Urgent: App crashes when I click 'Submit'. It's a payment issue."
ticket = model(prompt, SupportTicket)

print(ticket)  # Output: SupportTicket(priority='High', category='Payment', description='App crashes on submit')
Enter fullscreen mode Exit fullscreen mode

Use Case 2: Regex-Based Extraction

Want to extract a phone number in a specific format? Use a regex pattern:

import outlines
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load model
MODEL_NAME = "microsoft/Phi-3-mini-4k-instruct"
model = outlines.from_transformers(
    AutoModelForCausalLM.from_pretrained(MODEL_NAME, device_map="auto"),
    AutoTokenizer.from_pretrained(MODEL_NAME)
)

# Define regex for US phone number
phone_regex = r"\(\d{3}\)\s\d{3}-\d{4}"

# Extract phone number
prompt = "Contact: (123) 456-7890"
phone = model(prompt, phone_regex)

print(phone)  # Output: (123) 456-7890
Enter fullscreen mode Exit fullscreen mode

Other use cases:

  • E-commerce: Categorize products into structured formats (e.g., name, price, category).
  • Function calling: Extract structured arguments for API calls.
  • Document classification: Enforce outputs like Positive, Negative, or Neutral.

Check out Outlines’ real-world examples for more.

Outlines’ Language Support and Ecosystem

Outlines is Python-only for now, which makes sense given its tight integration with Python’s type system and libraries like Pydantic. It works seamlessly with popular LLM frameworks:

  • HuggingFace Transformers for local models.
  • OpenAI, Ollama, vLLM, and Gemini via simple adapters.
  • llama.cpp and SGLang for optimized inference.

Key limitation: No official support for other languages (e.g., JavaScript, Java). If you’re not a Python developer, you’d need to wrap Outlines in a Python-based API to use it elsewhere.

Example: Using Outlines with OpenAI

import outlines
from outlines.models import OpenAI

# Initialize OpenAI model
model = OpenAI(api_key="your-api-key", model_name="gpt-3.5-turbo")

# Define a simple type
from typing import Literal
sentiment = model("This movie was amazing!", Literal["Positive", "Negative", "Neutral"])

print(sentiment)  # Output: Positive
Enter fullscreen mode Exit fullscreen mode

This flexibility makes Outlines a go-to for Python developers working with any LLM backend.

Key Features at a Glance

Here’s a quick summary of Outlines’ capabilities:

Feature Details
Core Mechanism Constrained decoding with token masking and FSMs
Output Types Python types, JSON Schema, regex, context-free grammars
Truncation Handling Resumes FSM state for valid continuation
Use Cases Ticket parsing, product categorization, regex extraction, function calling
Language Python only, integrates with Transformers, OpenAI, Ollama, etc.
Installation pip install outlines

What’s Next for Structured LLM Output?

Outlines is a powerful tool for developers who need reliable, structured output from LLMs. Its use of constrained decoding and FSMs eliminates the need for fragile post-processing, making it ideal for production-grade applications. Whether you’re parsing customer tickets, extracting data with regex, or generating complex JSON, Outlines ensures your output is valid—every time, even for large or truncated responses.

To get started, install Outlines (pip install outlines) and experiment with the examples above. If you’re working on a project that needs structured LLM output, give Outlines a try and check out the official docs for more details. Want to dive deeper into FSM logic or specific integrations? Let me know in the comments!

Comments 0 total

    Add comment