Using Qwen2.5-VL + PEFT + Neo4J
Fine tuned by By Mohammed Safvan | Zackriya Solutions
🚀 TL;DR
Convert images of flowcharts or process diagrams directly into Neo4J-compatible JSON using a fine-tuned Vision Language Model (VLM). We started with Claude 3.5, but fine-tuned a Qwen2.5-VL-3B model using PEFT (LoRA) and got +23% improvement in edge detection.
🧩 Problem
We often find technical diagrams, flowcharts, and block diagrams sitting in PDFs, whiteboards, or scanned docs. They contain valuable logic and relationships—but they’re not queryable or usable unless manually extracted.
Can we automate this diagram-to-graph extraction?
That’s what Diagram2Graph does.
In this post, we’ll show how to use a fine-tuned Vision Language Model (VLM) to convert technical diagrams and flowcharts into structured JSON. This output can be directly used with Neo4J or other knowledge graph platforms, enabling AI systems to reason over visual information.
🔍 What We Built
A fine-tuned Vision Language Model that:
- Accepts an image of a diagram
- Extracts nodes, edges, and metadata
- Outputs structured JSON
- Compatible with Neo4J for graph querying
🔗 Resources
- 🧠 Model: Hugging Face - diagram2graph
- 📦 Dataset: diagramJSON
- 📹 Demo: YouTube Video
- 💻 Code: GitHub Repo
⚙️ Architecture at a Glance
+-------------------+ +---------------------------+
| Diagram Image +-------> | Fine-Tuned VLM (Qwen2.5) |
+-------------------+ +-------------+-------------+
|
v
+------------------------------+
| JSON (Nodes + Edges + Meta) |
+------------------------------+
|
v
+------------------------+
| Neo4J Integration |
+------------------------+
🤖 Why Not Use GPT-4 or Claude?
They work—but they're:
- API-bound (privacy concerns)
- Generic (prone to hallucination)
- Expensive (token + compute)
We fine-tuned a task-specific Vision Language Model (Qwen2.5-VL 3B) for diagram understanding and knowledge graph extraction.
🛠️ Tech Stack
- Model:
Qwen2.5-VL-3B
- Fine-tuning: PEFT (LoRA),
f32
, PyTorch Lightning - Dataset: 218 labeled diagram images
- Backend: FastAPI + Neo4J (via Cypher)
- Inference: Hugging Face Transformers
- Frontend: NextJS (WIP)
- Deployment: Kaggle + Lightning.ai
📊 Results
Task | Base Model (Claude 3.5) | Qwen2.5-VL-3B (Fine-Tuned) |
---|---|---|
Node Detection | 74.9% F1 | 89.1% F1 (+14.2%) |
Edge Detection | 46.05% F1 | 69.45% F1 (+23.4%) |
🧠 Training Details
Config | Value |
---|---|
Epochs | 10 |
Images | 200 (hand-labeled) |
Batch Size | 2 |
Method | LoRA (PEFT) |
Precision | bf16 |
GPU | L40S (48GB VRAM) |
Eval Metric | Edit Distance + F1 |
🧪 Try It Yourself
Colab Notebook (Inference only):
👉 Open Notebook
pip install -q transformers accelerate qwen-vl-utils[decord]==0.0.8
Inference Snippet:
import torch
from transformers import Qwen2_5_VLForConditionalGeneration, Qwen2_5_VLProcessor
MODEL_ID="zackriya/diagram2graph-adapters"
MAX_PIXELS = 1280 * 28 * 28
MIN_PIXELS = 256 * 28 * 28
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
MODEL_ID,
device_map="auto",
torch_dtype=torch.bfloat16
)
processor = Qwen2_5_VLProcessor.from_pretrained(
MODEL_ID,
min_pixels=MIN_PIXELS,
max_pixels=MAX_PIXELS
)
from qwen_vl_utils import process_vision_info
SYSTEM_MESSAGE = """You are a Vision Language Model specialized in extracting structured data from visual representations of process and flow diagrams.
Your task is to analyze the provided image of a diagram and extract the relevant information into a well-structured JSON format.
The diagram includes details such as nodes and edges. each of them have their own attributes.
Focus on identifying key data fields and ensuring the output adheres to the requested JSON structure.
Provide only the JSON output based on the extracted information. Avoid additional explanations or comments."""
def run_inference(image):
"""
Inference with the Model
"""
messages= [
{
"role": "system",
"content": [{"type": "text", "text": SYSTEM_MESSAGE}],
},
{
"role": "user",
"content": [
{
"type": "image",
# this image is handled by qwen_vl_utils's process_visio_Info so no need to worry about pil image or path
"image": image,
},
{
"type": "text",
"text": "Extract data in JSON format, Only give the JSON",
},
],
},
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, _ = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
return_tensors="pt",
)
inputs = inputs.to('cuda')
generated_ids = model.generate(**inputs, max_new_tokens=1024)
generated_ids_trimmed = [
out_ids[len(in_ids):]
for in_ids, out_ids
in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed,
skip_special_tokens=True,
clean_up_tokenization_spaces=False
)
return output_text
🧱 What’s Next?
- [ ] Neo4J Integration via Cypher parser
- [ ] Quantized model for edge devices
- [ ] Ollama / Python SDK for plug-and-play
- [ ] Frontend for uploads + natural queries
🙌 Thanks
Shoutout to:
- Anthropic for Claude API
- Hugging Face for open-source infra
- Lightning.ai for GPUs
- Roboflow for dataset inspiration
🌟 Final Thoughts
Task-specific VLMs like Diagram2Graph are a great middle-ground:
Smaller, faster, cheaper, and surprisingly accurate.
Instead of waiting for foundation models to "get better," let’s teach them our tasks.
Fine-tune the model. Own the workflow.
If this was helpful, give the GitHub Repo a ⭐ and follow for updates!
Reach out to us if you want to collaborate on building Ai products - Contact us