Building Next-Gen Invoice Scanning with AI and LLMs

It’s estimated that 80–90% of the world’s data is unstructured, with text files and documents making up a big chunk of it. Invoices are a perfect example of this chaos.

Each vendor uses a different layout, with formats and terminology that vary wildly across industries. Totals might appear in headers, footers, or hidden deep in tables. Then there are smudged scans, odd fonts, and delivery charges mixed with line items. It didn’t take long for us to see why traditional systems built on regexes and static templates struggled to keep up.

About a year ago, we hit a wall with invoice processing.

The automation pipelines we had built for FMCG, healthcare, and logistics worked beautifully, but invoices were a whole different beast.

Standard OCR tools did well with OCR invoice scanning, turning pixels into text. But when it came to mapping that text into structured, reliable outputs for finance systems, they fell short. We kept running into partial extractions, missing fields, and malformed JSON outputs.

Producing clean, schema-perfect data for downstream automation needed more than just text recognition. It required adaptability and contextual reasoning across unpredictable document formats.

That’s when we started thinking: what if we let AI handle the “understanding” part?

Over the past year, we rebuilt our client’s invoice processing system using Large Language Models (LLMs). We started with Claude, experimented rapidly, and finally moved to Google’s Gemini 2.5 Flash. The result is a production-grade workflow that ingests multi-page invoices, adapts to vendor-specific formats, and delivers schema-perfect JSON for invoice data extraction. Accuracy jumped to 97%, and compute costs dropped by 70%.

This post is for:

Tech teams struggling to scale invoice automation without blowing budgets.
Anyone curious how LLMs can complement OCR in real-world workflows.

Here’s what we’ll share:

The problem we ran into.
Why OCR alone isn’t enough for invoices.
How we combined LLMs with classic OCR.
What worked, what didn’t, and how we achieved production-grade results.

For us, this project reinforced something we’ve seen across multiple builds: clean, modular engineering paired with the right AI models can turn complex ideas into production-ready systems faster than you might expect.

This journey into automated invoice scanning is another example of that mindset at work.

The Challenge: Tackling Real-World Invoice Variability

For one of our SaaS startup clients, we were solving a critical problem: enabling AI to manage inventory at scale. The platform helps gas stations and convenience stores automate their inventory workflows, eliminating hours of manual invoice data entry and providing real-time visibility into product availability across locations.

The core challenge was handling the sheer variability in invoices. They came in all forms like computer-generated PDFs, handwritten paper notes, and semi-structured formats. On top of that, the quality of these invoices was far from reliable. Many were poorly scanned, crumpled, or so faded that even humans struggled to read them.

This variability made it clear that a standard OCR system on its own would never deliver the accuracy and resilience we needed for production-grade performance.

What Is Automated Invoice Scanning (and Why OCR Alone Doesn’t Cut It)

If you’ve ever built or used an invoice processing tool, you know the drill: scan a PDF, run OCR, parse the text, and pray the fields map correctly.

On paper, it sounds simple. In reality, invoices are a chaotic mix of:

Fixed fields (like invoice numbers and dates).
Semi-structured tables (line items, quantities, taxes).
Free-form notes and disclaimers scattered across pages.

OCR (Optical Character Recognition) is great at the first step, turning pixels into text. But once you ask it to understand the text or extract relationships (like matching vendors to line items), it falls short.

OCR Strength	OCR Limitation
Converts pixels to text	Lacks semantic understanding
Works well with fixed templates	Breaks when layouts vary
Fast and inexpensive	Struggles with skewed/noisy scans

We ran into all these problems while processing thousands of vendor invoices. One vendor’s total was labeled “Amount Due” in the header, another called it “Grand Total” in the footer. Some embedded line items neatly in tables, while others hid them within paragraph text.

This is where Large Language Models (LLMs) came in. Unlike OCR, they don’t just read the text, they reason about it.

Here’s how LLMs elevated our invoice scanning workflow:

Structured Data Extraction
They map raw text to rich JSON schemas—`invoice_number`, vendor_address, even nested arrays for line items.
Layout Flexibility
Whether the total is in the header or footer, LLMs can infer its meaning from context.
Imperfection Tolerance
Smudged or skewed scans? An LLM can deduce “₹12,599.00” from “₹12,59?.00” using surrounding cues.
Business Logic
Need to exclude delivery fees or flag alcohol content? You can encode these rules directly in prompts—no need for brittle template matching.
By combining OCR for raw text and LLMs for reasoning, you get the best of both worlds: speed at the pixel layer, intelligence at the data layer.

Technical Implementation: How We Combined LLMs with Classic OCR

So how did we actually build this?

Designing a robust pipeline for AI invoice extraction required solving two distinct challenges: converting raw OCR text into structured data and enriching that data with domain-specific intelligence.

The core idea was straightforward: let OCR handle what it does best,i.e., turning pixels into text and then delegate the more complex reasoning to an LLM.

But moving from prototype to production meant tackling issues like schema consistency, latency, and compute cost.

A single LLM call wasn’t enough to handle the complexity of multi-page invoices with nested line items and diverse vendor conventions. To overcome this, we split the workflow into two focused stages.

The first stage extracts core fields into a schema-perfect JSON, ensuring clean and consistent output. The second stage applies business logic, validation, and enrichment, mapping product categories, cross-checking UPCs, and embedding metadata.

This separation improved accuracy, reduced latency, and kept each model invocation purpose-built for its task.

1. Two-Stage LLM Pipeline

We designed a two-stage pipeline:

Stage 1: Raw Extraction

We used Gemini 2.5 Flash to convert raw OCR text into a structured JSON object. The model receives the PDF buffer and a strict schema (defined in Zod) to enforce consistent output.

const result = generateObject({
  schema: invoiceSchema,          // zod-defined JSON structure
  model: 'gemini-2.5-flash',
  prompt: systemPrompt,
  files: [pdfBuffer]
});

Here, we ended up with a clean invoice object that had all the key details like identifiers, dates, totals, and every line item, preserved exactly in the original order.

Stage 2: Categorization & Enrichment

The second LLM call adds intelligence:

Maps product descriptions like "Dove Shampoo 180 ml" to structured categories:

  {
    "category": "Personal Care",
    "subcategory": "Hair Care"
  }

Flags perishability, verifies alcohol content, validates size units.

Cross-checks UPCs and pulls metadata (brand, images) from external APIs for product invoices.

This separation helped us to keep the initial extraction fast and fully schema-focused.

Once we had clean, structured data, it became much easier to layer on domain-specific rules later like mapping product categories, validating UPCs, or flagging edge cases without overloading the core extraction step.

2. Prompt Engineering Principles

In any LLM-powered workflow, prompt design often determines whether your system feels like a reliable engine or an unpredictable black box.

For automated invoice scanning, getting the prompts right was critical to ensure clean, schema-perfect outputs, reduce parsing errors, and avoid edge-case failures on messy vendor documents.

Here are some of the techniques that made our pipeline production-grade:

Technique	Impact
Schema-Driven (Zod)	Forces well-formed JSON, reduces parsing errors.
Explicit Instructions	“Preserve original line-item order” improved fidelity.
Focused Extraction	Fuel invoices skipped taxes and delivery charges.
Iterative Refinement	Continuous A/B tests slashed false positives.

Fine-tuned prompts act as the glue between raw OCR output and structured data pipelines.
In high-volume automated invoice scanning systems, even small prompt improvements can save hours of manual QA and reduce integration failures downstream.
Treating prompt engineering as a core part of the system completely changed the game for us.
It turned what started as a fragile prototype into a workflow that’s resilient and ready for production at scale. As LLMs keep evolving, we’ve learned that pairing strong prompt strategies with strict schema validation is the key to building reliable and scalable invoice data extraction pipelines.

Which LLM Performed the Best?

Building a scalable pipeline for automated invoice scanning meant we needed a model that could handle large, multi-page documents without losing context, deliver high accuracy for invoice data extraction, and still keep latency and costs low for high-volume workloads.

We first tried Claude 3.5 and 4. Their 64k token context windows were impressive and worked really well for OCR invoice scanning, especially on multi-page telecom and utility invoices. Out of the box, they supported PDF parsing and felt solid for early prototyping.

But as we moved closer to production, the high per-token cost and fragile PDF support started causing friction. It also meant adding glue code we weren’t keen on maintaining long term.

Next, we tested Gemini 2.5 Pro and Flash. Both came with comparable 65k token context lengths, which made them just as capable of processing long-form invoice data without the need for aggressive chunking.

Gemini 2.5 Pro really stood out for its extraction quality and consistency across diverse vendor layouts.

But under bulk queues, latency became an issue, making it less suitable for real-time AI invoice extraction at scale.

Model	Pros	Cons
Claude 3.5 / 4	Huge context window (64k tokens).	High cost, fragile PDF support.
Gemini 2.5 Pro	Strong extraction quality.	Higher latency for bulk queues.
Gemini 2.5 Flash	Fast, cheap, and quality matched Pro.	Slightly less robust for rare edge cases.

In production, Gemini 2.5 Flash became the clear winner.
It delivered schema-perfect JSON reliably, enabling us to automate invoice extraction across thousands of diverse vendor formats.
The model’s speed and efficiency allowed us to process high-volume invoice queues while reducing per-document compute costs by 70% compared to Claude.
For any team building an invoice scanning software or designing pipelines for OCR invoice scanning, Gemini 2.5 Flash offers an excellent balance of speed, cost, and quality.
Combined with a strong prompt-engineering strategy, it’s a powerful foundation for robust, context-aware data pipelines.

Results & Lessons Learned

Looking back, combining OCR invoice scanning with Large Language Models felt like moving from static maps to a fully interactive GPS system.

The shift transformed a brittle, template-driven process into a dynamic pipeline capable of adapting to any vendor layout and handling high volumes with ease.

Accuracy: Line-item recall improved dramatically, rising from 88% with OCR and regex-based extraction to 97% with OCR combined with LLMs. This uplift made the pipeline reliable enough for direct integration with downstream finance systems.

Scalability: Gemini’s 65k token context window enabled us to process large, multi-page telecom invoices without the need for aggressive document chunking. This eliminated complexity in pre-processing and reduced edge-case errors during invoice data extraction.

Cost Efficiency: Migrating to Gemini 2.5 Flash reduced per-document compute costs by approximately 70% compared to Claude. This cost-performance balance made the solution viable for high-volume scenarios like enterprise billing systems or large-scale AI invoice extraction workflows.

Along the way, we uncovered some important engineering insights:

Schema Enforcement: Defining strict JSON schemas using Zod proved essential. It ensured that the output from the LLM consistently adhered to the required structure, eliminating the need for costly post-processing and reducing API error rates.

Prompt Tuning: Achieving production-grade consistency required iterative refinement of prompts. Simple changes such as explicitly instructing the model to preserve the original line-item order, saved hours of downstream QA and debugging.

Hybrid Approach: LLMs work best as part of a broader toolchain. When paired with classic OCR engines and external APIs, they deliver the reasoning and flexibility that traditional systems lack. This hybrid strategy is the key to successfully automate invoice extraction without sacrificing accuracy or speed.

By combining the strengths of OCR and LLMs, we were able to create a system capable of handling real-world document variability while delivering clean, structured outputs ready for ingestion into financial and operational systems.

Closing Thoughts

Invoice processing has always felt like a battle against edge cases. Vendor-specific templates, inconsistent layouts, and unpredictable data structures often pushed teams like ours into building fragile systems full of hardcoded rules and endless patchwork fixes.

Bringing LLMs into the workflow changed everything for us.

We created a pipeline that’s not just faster but also far more resilient and adaptable to diverse vendor formats. The system blends OCR invoice scanning for precise text extraction with the reasoning power of LLMs to understand context, relationships, and even complex business rules.

If you’re exploring similar solutions, here’s what we’ve learned: don’t try to replace OCR entirely, augment it. Let OCR handle raw text extraction efficiently, then use LLMs for mapping fields, managing layout variability, and applying domain-specific logic.

This hybrid strategy is what makes enterprise-grade invoice data extraction scalable and reliable. It’s also the key to automate invoice extraction workflows without introducing unnecessary complexity or skyrocketing costs.

We truly believe the future of AI invoice extraction lies in combining traditional tools with modern AI models to deliver structured, schema-perfect data that’s ready for operational systems.

What about you?

Are you working on something similar or exploring how LLMs could fit into your workflows?

We’d love to hear your thoughts. Drop a comment below — we’re happy to share more from our experience.

— Rahul Retnan

The author is a senior software engineer at RaftLabs working extensively with Large Language Models (LLMs), LangChain to design intelligent automation pipelines, including enterprise-grade systems.

Originally posted on RaftLabs

RaftLabs - AI App Dev Agency @raftlabs