🧠 Diagram2Graph: Fine-Tuning a Vision Language Model to Extract Knowledge Graphs from Diagrams
Sujith S

Sujith S @sujiths

About: Solution Architect | Entrepreneur

Joined:
Jan 24, 2022

🧠 Diagram2Graph: Fine-Tuning a Vision Language Model to Extract Knowledge Graphs from Diagrams

Publish Date: Apr 5
9 0

Using Qwen2.5-VL + PEFT + Neo4J

Fine tuned by By Mohammed Safvan | Zackriya Solutions


🚀 TL;DR

Convert images of flowcharts or process diagrams directly into Neo4J-compatible JSON using a fine-tuned Vision Language Model (VLM). We started with Claude 3.5, but fine-tuned a Qwen2.5-VL-3B model using PEFT (LoRA) and got +23% improvement in edge detection.


🧩 Problem

We often find technical diagrams, flowcharts, and block diagrams sitting in PDFs, whiteboards, or scanned docs. They contain valuable logic and relationships—but they’re not queryable or usable unless manually extracted.

Can we automate this diagram-to-graph extraction?

That’s what Diagram2Graph does.

In this post, we’ll show how to use a fine-tuned Vision Language Model (VLM) to convert technical diagrams and flowcharts into structured JSON. This output can be directly used with Neo4J or other knowledge graph platforms, enabling AI systems to reason over visual information.


🔍 What We Built

A fine-tuned Vision Language Model that:

  • Accepts an image of a diagram
  • Extracts nodes, edges, and metadata
  • Outputs structured JSON
  • Compatible with Neo4J for graph querying

🔗 Resources


⚙️ Architecture at a Glance

+-------------------+         +---------------------------+
|   Diagram Image   +-------> |  Fine-Tuned VLM (Qwen2.5) |
+-------------------+         +-------------+-------------+
                                            |
                                            v
                           +------------------------------+
                           |  JSON (Nodes + Edges + Meta) |
                           +------------------------------+
                                            |
                                            v
                              +------------------------+
                              |     Neo4J Integration  |
                              +------------------------+
Enter fullscreen mode Exit fullscreen mode

🤖 Why Not Use GPT-4 or Claude?

They work—but they're:

  • API-bound (privacy concerns)
  • Generic (prone to hallucination)
  • Expensive (token + compute)

We fine-tuned a task-specific Vision Language Model (Qwen2.5-VL 3B) for diagram understanding and knowledge graph extraction.


🛠️ Tech Stack

  • Model: Qwen2.5-VL-3B
  • Fine-tuning: PEFT (LoRA), f32, PyTorch Lightning
  • Dataset: 218 labeled diagram images
  • Backend: FastAPI + Neo4J (via Cypher)
  • Inference: Hugging Face Transformers
  • Frontend: NextJS (WIP)
  • Deployment: Kaggle + Lightning.ai

📊 Results

Task Base Model (Claude 3.5) Qwen2.5-VL-3B (Fine-Tuned)
Node Detection 74.9% F1 89.1% F1 (+14.2%)
Edge Detection 46.05% F1 69.45% F1 (+23.4%)

🧠 Training Details

Config Value
Epochs 10
Images 200 (hand-labeled)
Batch Size 2
Method LoRA (PEFT)
Precision bf16
GPU L40S (48GB VRAM)
Eval Metric Edit Distance + F1

🧪 Try It Yourself

Colab Notebook (Inference only):

👉 Open Notebook

pip install -q transformers accelerate qwen-vl-utils[decord]==0.0.8
Enter fullscreen mode Exit fullscreen mode

Inference Snippet:

import torch
from transformers import Qwen2_5_VLForConditionalGeneration, Qwen2_5_VLProcessor


MODEL_ID="zackriya/diagram2graph-adapters"
MAX_PIXELS = 1280 * 28 * 28
MIN_PIXELS = 256 * 28 * 28


model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    MODEL_ID,
    device_map="auto",
    torch_dtype=torch.bfloat16
)

processor = Qwen2_5_VLProcessor.from_pretrained(
    MODEL_ID,
    min_pixels=MIN_PIXELS,
    max_pixels=MAX_PIXELS
)
Enter fullscreen mode Exit fullscreen mode
from qwen_vl_utils import process_vision_info

SYSTEM_MESSAGE = """You are a Vision Language Model specialized in extracting structured data from visual representations of process and flow diagrams.
Your task is to analyze the provided image of a diagram and extract the relevant information into a well-structured JSON format.
The diagram includes details such as nodes and edges. each of them have their own attributes.
Focus on identifying key data fields and ensuring the output adheres to the requested JSON structure.
Provide only the JSON output based on the extracted information. Avoid additional explanations or comments."""

def run_inference(image):
  """
  Inference with the Model
  """
  messages= [
      {
          "role": "system",
          "content": [{"type": "text", "text": SYSTEM_MESSAGE}],
      },
      {
          "role": "user",
          "content": [
              {
                  "type": "image",
                  # this image is handled by qwen_vl_utils's process_visio_Info so no need to worry about pil image or path
                  "image": image,
              },
              {
                  "type": "text",
                  "text": "Extract data in JSON format, Only give the JSON",
              },
          ],
      },
  ]

  text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
  image_inputs, _ = process_vision_info(messages)

  inputs = processor(
      text=[text],
      images=image_inputs,
      return_tensors="pt",
  )
  inputs = inputs.to('cuda')

  generated_ids = model.generate(**inputs, max_new_tokens=1024)
  generated_ids_trimmed = [
      out_ids[len(in_ids):]
      for in_ids, out_ids
      in zip(inputs.input_ids, generated_ids)
  ]

  output_text = processor.batch_decode(
      generated_ids_trimmed,
      skip_special_tokens=True,
      clean_up_tokenization_spaces=False
  )
  return output_text

Enter fullscreen mode Exit fullscreen mode

🧱 What’s Next?

  • [ ] Neo4J Integration via Cypher parser
  • [ ] Quantized model for edge devices
  • [ ] Ollama / Python SDK for plug-and-play
  • [ ] Frontend for uploads + natural queries

🙌 Thanks

Shoutout to:

  • Anthropic for Claude API
  • Hugging Face for open-source infra
  • Lightning.ai for GPUs
  • Roboflow for dataset inspiration

🌟 Final Thoughts

Task-specific VLMs like Diagram2Graph are a great middle-ground:

Smaller, faster, cheaper, and surprisingly accurate.

Instead of waiting for foundation models to "get better," let’s teach them our tasks.

Fine-tune the model. Own the workflow.

If this was helpful, give the GitHub Repo a ⭐ and follow for updates!

Reach out to us if you want to collaborate on building Ai products - Contact us

Comments 0 total

    Add comment