VLM Pipeline with Docling

Hands-on experience using VLM Pipeline from Docling.

Introduction

Vision-Language Models (VLMs) and Large Language Models (LLMs) are powerful AI tools that, when used together, can unlock a wide range of practical applications by leveraging both visual and textual information. Here are some key use cases:

Enhanced Content Understanding and Analysis:

Multimodal Document Understanding: VLMs can analyze documents containing both text and images (e.g., reports, scientific papers, invoices). Paired with LLMs, they can extract structured information, answer complex questions that require reasoning across both modalities, and summarize the content more effectively. For instance, analyzing a scientific paper with diagrams, the VLM can identify the visual elements, and the LLM can interpret the related text, leading to a deeper understanding than either model could achieve alone.
Social Media Analysis: By processing both the text content of posts and accompanying images or videos, VLMs and LLMs can provide a more nuanced understanding of sentiment, identify harmful content (e.g., hate symbols in memes), and detect misinformation that relies on a combination of text and visuals. This is crucial for content moderation and brand safety.
E-commerce Product Understanding: VLMs can analyze product images, while LLMs process descriptions and customer reviews. Together, they can provide richer insights into product features, customer sentiment related to visual aspects, and improve product recommendations. A user could search for a dress based on a picture, and the LLM could then provide detailed information about the fabric, care instructions, and style based on the product description.

Improved Information Retrieval and Search:

Visual Search with Language Understanding: Users can search for information using images, and the LLM can understand follow-up questions or provide detailed textual information related to the visual content. For example, a user could upload a picture of a plant, and the LLM, understanding the visual features identified by the VLM, could provide its species name, care instructions, and related articles.
Multimodal Question Answering: Systems can answer questions that require reasoning about both images/videos and text. For instance, given an image of a scene and a question like “What are the people in the image doing?”, the VLM identifies the actions visually, and the LLM formulates a natural language answer.

Enhanced Human-Computer Interaction:

AI Assistants with Visual Awareness: Virtual assistants can understand and respond to instructions that involve both language and visual input. For example, a user could say, “Hey assistant, what’s the capital of the country in this picture?” showing an image of the French flag. The VLM identifies the flag, and the LLM provides “Paris.”
Accessibility for Visually Impaired Users: VLMs can generate detailed image captions (alt-text), making visual content accessible. LLMs can then use these captions to provide more comprehensive descriptions and answer questions about the images in a natural language format.
Robotics and Automation: Robots equipped with VLMs can understand natural language commands that refer to visual elements in their environment, such as “Pick up the blue block on the table.” The VLM identifies the blue block, and the LLM interprets the action to be performed.

Creative Content Generation:

Image Captioning and Text Generation from Images: VLMs can generate descriptive captions for images, which can then be used by LLMs to create more elaborate stories, poems, or articles inspired by the visual content.
Multimodal Storytelling: Combining VLMs and LLMs allows for the creation of richer narratives where visual elements and textual descriptions are tightly integrated. For example, a system could generate a children’s book with both illustrations and accompanying text based on a high-level prompt.

Industry-Specific Applications:

Healthcare: Analyzing medical images (X-rays, MRIs) alongside patient history (textual data) to assist in diagnosis and treatment planning. VLMs can identify anomalies in images, and LLMs can correlate these findings with patient records.
Insurance: Processing accident photos and claim descriptions to automate damage assessment and fraud detection. The VLM analyzes the visual damage, and the LLM interprets the textual report.
Education: Creating interactive learning materials that combine images, diagrams, and textual explanations. VLMs can analyze visual aids, and LLMs can generate related questions and explanations.

In essence, the synergy between VLMs and LLMs allows AI systems to perceive and understand the world more like humans do, by integrating information from multiple sensory modalities. This leads to more intelligent, context-aware, and versatile applications across various domains. The ongoing advancements in both VLM and LLM technologies promise even more exciting and practical use cases in the future.

References;

Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions: https://arxiv.org/html/2404.07214v2
What are vision language models (VLMs): https://www.ibm.com/think/topics/vision-language-models
How are VLMs used in social media platforms: https://milvus.io/ai-quick-reference/how-are-vlms-used-in-social-media-platforms

VLM Pipeline

What is a VLM pipelines? The VLM pipeline operates by feeding an image into the image encoder, processing its output with the adapter, and integrating this into the LLM. At the same time, the text input is tokenized and passed through the language model. This allows the system to understand and combine both types of data.

Using Docling to build a VLM Pipeline

To test and implement a Vision-Language Model (VLM) pipeline, you can start by adapting the straightforward example available on the Docling website. I’ve modified the original code to leverage my local installation of the “granite3.2-vision” model, managed through Ollama. This setup allows for experimentation and integration using locally hosted resources.

To test locally with Ollama and Granite, download the LLM locally.

ollama run granite3.2-vision

Once downloaded, test the functionality.

curl http://localhost:11434/v1/chat/completions \
-d '{
    "model": "granite3.2-vision:latest",
    "messages": [
      {
        "role": "user",
        "content": "Hello"
      }
    ]
  }'

### output example
{"id":"chatcmpl-981","object":"chat.completion","created":1747306935,"model":"granite3.2-vision:latest","system_fingerprint":"fp_ollama","choices":[{"index":0,"message":{"role":"assistant","content":"\nHello! How can I assist you today?"},"finish_reason":"stop"}],"usage":{"prompt_tokens":48,"completion_tokens":11,"total_tokens":59}}

Use the code provided on Docling site. I did the following to prepare the environment.

python3 -m venv venv
source venv/bin/activate

pip install --upgrade pip

pip install requests
pip install docling

Beyond simply adapting the Docling example for local execution with “granite3.2-vision” and Ollama, I’ve also refined the code for improved robustness and usability. This involved implementing a more generous “timeout” to accommodate potentially longer processing times and integrating specific debugging statements to facilitate troubleshooting. Moreover, to enhance the output’s utility, I redirected it from the console to a well-structured Markdown file. It’s worth noting that the original sample code demonstrated usage within the watsonx.ai platform, a dependency I’ve entirely bypassed by ensuring all processing occurs locally within my environment.

import logging
import os
from pathlib import Path
import requests
from dotenv import load_dotenv
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import (
    ApiVlmOptions,
    ResponseFormat,
    VlmPipelineOptions,
)
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.pipeline.vlm_pipeline import VlmPipeline
from io import BytesIO
from PIL import Image


def ollama_vlm_options(model: str, prompt: str):
    options = ApiVlmOptions(
        url="http://localhost:11434/v1/chat/completions",  # the default Ollama endpoint
        params=dict(
            model=model,
        ),
        prompt=prompt,
        timeout=300,  # Increased timeout to 300 seconds
        scale=1.0,
        response_format=ResponseFormat.MARKDOWN,
    )
    return options


def watsonx_vlm_options(model: str, prompt: str):
    load_dotenv()
    api_key = os.environ.get("WX_API_KEY")
    project_id = os.environ.get("WX_PROJECT_ID")

    def _get_iam_access_token(api_key: str) -> str:
        res = requests.post(
            url="https://iam.cloud.ibm.com/identity/token",
            headers={
                "Content-Type": "application/x-www-form-urlencoded",
            },
            data=f"grant_type=urn:ibm:params:oauth:grant-type:apikey&apikey={api_key}",
        )
        res.raise_for_status()
        api_out = res.json()
        print(f"{api_out=}")
        return api_out["access_token"]

    options = ApiVlmOptions(
        url="https://us-south.ml.cloud.ibm.com/ml/v1/text/chat?version=2023-05-29",
        params=dict(
            model_id=model,
            project_id=project_id,
            parameters=dict(
                max_new_tokens=400,
            ),
        ),
        headers={
            "Authorization": "Bearer " + _get_iam_access_token(api_key=api_key),
        },
        prompt=prompt,
        timeout=60,
        response_format=ResponseFormat.MARKDOWN,
    )
    return options

logger = logging.getLogger(__name__)  # Get the logger


def api_image_request(url: str, prompt: str, image_data: bytes, timeout: int) -> str:
    """
    Sends an image and prompt to a VLM API and returns the text response.

    Args:
        url: The URL of the VLM API endpoint.
        prompt: The text prompt to send with the image.
        image_data: The image data as bytes.
        timeout: The timeout for the request in seconds.

    Returns:
        The text response from the API.

    Raises:
        requests.exceptions.HTTPError: If the API returns an HTTP error.
        Exception: For other errors during the API call.
    """
    try:
        logger.debug(f"api_image_request: Sending request to URL: {url}")  # Log URL
        logger.debug(f"api_image_request: Prompt: {prompt[:50]}...")  # Log first 50 chars of prompt
        logger.debug(f"api_image_request: Image data length: {len(image_data)} bytes")  # Log image size

        r = requests.post(
            url,
            headers={"Content-Type": "multipart/form-data"},
            files={
                "image": ("image.jpg", image_data, "image/jpeg"),
                "prompt": (None, prompt),
            },
            timeout=timeout,
        )

        logger.debug(f"api_image_request: Response status code: {r.status_code}")  # Log status code
        logger.debug(f"api_image_request: Response text: {r.text[:100]}...")  # Log first 100 chars of response

        r.raise_for_status()
        return r.text
    except requests.exceptions.HTTPError as e:
        logger.error(f"api_image_request: HTTPError: {e}, Response text: {e.response.text}")
        raise e
    except Exception as e:
        logger.error(f"api_image_request: Exception: {e}")
        raise e


def main():
    logging.basicConfig(level=logging.INFO)
    input_doc_path = Path("./input/2401.03955v8.pdf")
    output_file_path = Path("output.md")  # Define the output file

    pipeline_options = VlmPipelineOptions(
        enable_remote_services=True
    )

    pipeline_options.vlm_options = ollama_vlm_options(
        model="granite3.2-vision:latest",
        prompt="OCR the full page to markdown.",
    )

    doc_converter = DocumentConverter(
        format_options={
            InputFormat.PDF: PdfFormatOption(
                pipeline_options=pipeline_options,
                pipeline_cls=VlmPipeline,
            )
        }
    )
    try:
        result = doc_converter.convert(input_doc_path)
        markdown_content = result.document.export_to_markdown()

        # Write the markdown content to a file
        with open(output_file_path, "w", encoding="utf-8") as f:
            f.write(markdown_content)
        logging.info(f"Markdown output written to {output_file_path}")
    except Exception as e:
        logging.error(f"An error occurred: {e}")  # catch any error
        print(f"An error occurred: {e}")



if __name__ == "__main__":
    main()

Et voilà 😉

Conclusion

In conclusion, the sample code from Docling underscore its potential for constructing a practical VLM pipeline. This refined pipeline, capable of generating structured and accessible output (as demonstrated by the Markdown file), can then be seamlessly integrated with a Large Language Model. This powerful combination paves the way for building robust and industrialized applications that effectively leverage both visual and textual understanding. The local execution facilitated by Ollama further highlights the flexibility and adaptability of the Docling framework for various deployment scenarios.

Alain Airom @aairom