Ollama-OCR for High-Precision OCR with Ollama
Bytefer

Bytefer @bytefer

About: 10 years of full-stack development experience, 8 years of technical writing experience, love new technologies, and love to share.

Joined:
Aug 7, 2022

Ollama-OCR for High-Precision OCR with Ollama

Publish Date: Nov 25 '24
158 17

Llama 3.2-Vision is a multimodal large language model available in 11B and 90B sizes, capable of processing both text and image inputs to generate text outputs. The model excels in visual recognition, image reasoning, image description, and answering image-related questions, outperforming existing open-source and closed-source multimodal models across multiple industry benchmarks.

Find Your Best Next.js Starters

Find Awesome Shadcn

Llama 3.2-Vision Examples

Handwriting

llama3.2-vision-handwriting

Optical Character Recognition (OCR)

llama3.2-vision-ocr

In this article I will describe how to call the Llama 3.2-Vision 11B modeling service run by Ollama and implement image text recognition (OCR) functionality using Ollama-OCR.

Features of Ollama-OCR

🚀 High accuracy text recognition using Llama 3.2-Vision model
📝 Preserves original text formatting and structure
🖼️ Supports multiple image formats: JPG, JPEG, PNG
⚡️ Customizable recognition prompts and models
🔍 Markdown output format option
💪 Robust error handling

MacOS Vision OCR: Accurate and Fast OCR Tool for macOS

Installing Ollama

Before you can start using Llama 3.2-Vision, you need to install Ollama, a platform that supports running multimodal models locally. Follow the steps below to install it:

  1. Download Ollama: Visit the official Ollama website to download the installation package for your operating system. Download Ollama
  2. Install Ollama: Follow the prompts to complete the installation according to the downloaded installation package.

Install Llama 3.2-Vision 11B

After installing Ollama, you can install the Llama 3.2-Vision 11B model with the following command:

ollama run llama3.2-vision
Enter fullscreen mode Exit fullscreen mode

How to use Ollama-OCR

npm install ollama-ocr
# or using pnpm
pnpm add ollama-ocr
Enter fullscreen mode Exit fullscreen mode

OCR

import { ollamaOCR, DEFAULT_OCR_SYSTEM_PROMPT } from "ollama-ocr";

async function runOCR() {
  const text = await ollamaOCR({
    filePath: "./handwriting.jpg",
    systemPrompt: DEFAULT_OCR_SYSTEM_PROMPT,
  });
  console.log(text);
}
Enter fullscreen mode Exit fullscreen mode

Input Image:

handwriting-for-ollama-ocr

Output:
The Llama 3.2-Vision collection of multimodal large language models (LLMs) is a collection of instruction-tuned image reasoning generative models in 118 and 908 sizes (text + images in / text out). The Llama 3.2-Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. The models outperform many of the available open source and closed multimodal models on common industry benchmarks.

Markdown Output

import { ollamaOCR, DEFAULT_MARKDOWN_SYSTEM_PROMPT } from "ollama-ocr";

async function runOCR() {
  const text = await ollamaOCR({
    filePath: "./trader-joes-receipt.jpg",
    systemPrompt: DEFAULT_MARKDOWN_SYSTEM_PROMPT,
  });
  console.log(text);
}
Enter fullscreen mode Exit fullscreen mode

Input Image:

trader-joes-receipt

Output:

markdown-output-of-ollama-ocr

Use MiniCPM-V 2.6 Vision Model

async function runOCR() {
  const text = await ollamaOCR({
    model: "minicpm-v",
    filePath: "./handwriting.jpg.jpg",
    systemPrompt: DEFAULT_OCR_SYSTEM_PROMPT,
  });
  console.log(text);
}
Enter fullscreen mode Exit fullscreen mode

ollama-ocr is using a local vision model, if you want to use the online Llama 3.2-Vision model, try the llama-ocr library.

Comments 17 total

  • ogodo olutayo
    ogodo olutayoNov 27, 2024

    Can you please what the speed is like for a image-based pdf with over 60 pages?

    • Bytefer
      ByteferNov 27, 2024

      Recognition speed depends on the performance of your current device.

  • Tan KC
    Tan KCNov 27, 2024

    Can local ollama minicpm-v be used

    • Bytefer
      ByteferNov 27, 2024

      The minicpm-v visual model is also supported with the following code:

        const text = await ollamaOCR({
          model: "minicpm-v",
          filePath: "./trader-joes-receipt.jpg",
          systemPrompt: DEFAULT_OCR_SYSTEM_PROMPT,
        });
      
      Enter fullscreen mode Exit fullscreen mode

      minicpm-v-ocr-result

  • Ramya Menon
    Ramya MenonNov 27, 2024

    Damn needed this.

  • Akshay Ballal
    Akshay BallalNov 27, 2024

    Does it hallucinate with large tables with lot of numbers ?

    • Bytefer
      ByteferNov 27, 2024

      You can test the recognition of the Llama-3.2-90B-Vision at the llamaocr website.

  • Nov Piseth
    Nov PisethNov 28, 2024

    how to add other language like asia language in UTF8. thanks

  • Danny
    DannyNov 28, 2024

    Possible to use 90B model?

  • Piumal Piumal
    Piumal PiumalDec 1, 2024

    Best Programming Codes to Sell

    Get the best programming codes — 5000+ codes to buy or download for free! 👇👇👇

    newprogrammingcods.blogspot.com

  • jitendra mohite
    jitendra mohiteFeb 27, 2025

    any update on when PDF OCR would be GA with this.

  • Laxman
    LaxmanMar 4, 2025

    Llama 3.2-Vision mentioned in the article is indeed very powerful, especially in multimodal tasks like high-precision OCR, which is truly impressive! However, the command-line operations in Ollama can be quite cumbersome. A more intuitive user interface might be more user-friendly. Does anyone have any suggestions?

Add comment