🧠 Diagram2Graph: Fine-Tuning a Vision Language Model to Extract Knowledge Graphs from Diagrams

Using Qwen2.5-VL + PEFT + Neo4J

Fine tuned by By Mohammed Safvan | Zackriya Solutions

🚀 TL;DR

Convert images of flowcharts or process diagrams directly into Neo4J-compatible JSON using a fine-tuned Vision Language Model (VLM). We started with Claude 3.5, but fine-tuned a Qwen2.5-VL-3B model using PEFT (LoRA) and got +23% improvement in edge detection.

🧩 Problem

We often find technical diagrams, flowcharts, and block diagrams sitting in PDFs, whiteboards, or scanned docs. They contain valuable logic and relationships—but they’re not queryable or usable unless manually extracted.

Can we automate this diagram-to-graph extraction?

That’s what Diagram2Graph does.

In this post, we’ll show how to use a fine-tuned Vision Language Model (VLM) to convert technical diagrams and flowcharts into structured JSON. This output can be directly used with Neo4J or other knowledge graph platforms, enabling AI systems to reason over visual information.

🔍 What We Built

A fine-tuned Vision Language Model that:

Accepts an image of a diagram
Extracts nodes, edges, and metadata
Outputs structured JSON
Compatible with Neo4J for graph querying

🔗 Resources

🧠 Model: Hugging Face - diagram2graph
📦 Dataset: diagramJSON
📹 Demo: YouTube Video
💻 Code: GitHub Repo

⚙️ Architecture at a Glance

+-------------------+         +---------------------------+
|   Diagram Image   +-------> |  Fine-Tuned VLM (Qwen2.5) |
+-------------------+         +-------------+-------------+
                                            |
                                            v
                           +------------------------------+
                           |  JSON (Nodes + Edges + Meta) |
                           +------------------------------+
                                            |
                                            v
                              +------------------------+
                              |     Neo4J Integration  |
                              +------------------------+

🤖 Why Not Use GPT-4 or Claude?

They work—but they're:

API-bound (privacy concerns)
Generic (prone to hallucination)
Expensive (token + compute)

We fine-tuned a task-specific Vision Language Model (Qwen2.5-VL 3B) for diagram understanding and knowledge graph extraction.

🛠️ Tech Stack

Model: Qwen2.5-VL-3B
Fine-tuning: PEFT (LoRA), f32, PyTorch Lightning
Dataset: 218 labeled diagram images
Backend: FastAPI + Neo4J (via Cypher)
Inference: Hugging Face Transformers
Frontend: NextJS (WIP)
Deployment: Kaggle + Lightning.ai

📊 Results

Task	Base Model (Claude 3.5)	Qwen2.5-VL-3B (Fine-Tuned)
Node Detection	74.9% F1	89.1% F1 (+14.2%)
Edge Detection	46.05% F1	69.45% F1 (+23.4%)

🧠 Training Details

Config	Value
Epochs	10
Images	200 (hand-labeled)
Batch Size	2
Method	LoRA (PEFT)
Precision	bf16
GPU	L40S (48GB VRAM)
Eval Metric	Edit Distance + F1

🧪 Try It Yourself

Colab Notebook (Inference only):

👉 Open Notebook

pip install -q transformers accelerate qwen-vl-utils[decord]==0.0.8

Inference Snippet:

import torch
from transformers import Qwen2_5_VLForConditionalGeneration, Qwen2_5_VLProcessor


MODEL_ID="zackriya/diagram2graph-adapters"
MAX_PIXELS = 1280 * 28 * 28
MIN_PIXELS = 256 * 28 * 28


model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    MODEL_ID,
    device_map="auto",
    torch_dtype=torch.bfloat16
)

processor = Qwen2_5_VLProcessor.from_pretrained(
    MODEL_ID,
    min_pixels=MIN_PIXELS,
    max_pixels=MAX_PIXELS
)

from qwen_vl_utils import process_vision_info

SYSTEM_MESSAGE = """You are a Vision Language Model specialized in extracting structured data from visual representations of process and flow diagrams.
Your task is to analyze the provided image of a diagram and extract the relevant information into a well-structured JSON format.
The diagram includes details such as nodes and edges. each of them have their own attributes.
Focus on identifying key data fields and ensuring the output adheres to the requested JSON structure.
Provide only the JSON output based on the extracted information. Avoid additional explanations or comments."""

def run_inference(image):
  """
  Inference with the Model
  """
  messages= [
      {
          "role": "system",
          "content": [{"type": "text", "text": SYSTEM_MESSAGE}],
      },
      {
          "role": "user",
          "content": [
              {
                  "type": "image",
                  # this image is handled by qwen_vl_utils's process_visio_Info so no need to worry about pil image or path
                  "image": image,
              },
              {
                  "type": "text",
                  "text": "Extract data in JSON format, Only give the JSON",
              },
          ],
      },
  ]

  text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
  image_inputs, _ = process_vision_info(messages)

  inputs = processor(
      text=[text],
      images=image_inputs,
      return_tensors="pt",
  )
  inputs = inputs.to('cuda')

  generated_ids = model.generate(**inputs, max_new_tokens=1024)
  generated_ids_trimmed = [
      out_ids[len(in_ids):]
      for in_ids, out_ids
      in zip(inputs.input_ids, generated_ids)
  ]

  output_text = processor.batch_decode(
      generated_ids_trimmed,
      skip_special_tokens=True,
      clean_up_tokenization_spaces=False
  )
  return output_text

🧱 What’s Next?

[ ] Neo4J Integration via Cypher parser
[ ] Quantized model for edge devices
[ ] Ollama / Python SDK for plug-and-play
[ ] Frontend for uploads + natural queries

🙌 Thanks

Shoutout to:

Anthropic for Claude API
Hugging Face for open-source infra
Lightning.ai for GPUs
Roboflow for dataset inspiration

🌟 Final Thoughts

Task-specific VLMs like Diagram2Graph are a great middle-ground:

Smaller, faster, cheaper, and surprisingly accurate.

Instead of waiting for foundation models to "get better," let’s teach them our tasks.

Fine-tune the model. Own the workflow.

If this was helpful, give the GitHub Repo a ⭐ and follow for updates!

Reach out to us if you want to collaborate on building Ai products - Contact us

Sujith S @sujiths