Taming Unstructured Data: From PDFs to JSON with Quarkus and Docling
Markus

Markus @myfear

About: Markus is a Java Champion, former Java EE Expert Group member, founder of JavaLand, reputed speaker at Java conferences around the world, and a very well known figure in the Enterprise Java world.

Location:
Munich, Germany
Joined:
Mar 26, 2024

Taming Unstructured Data: From PDFs to JSON with Quarkus and Docling

Publish Date: Jul 13
0 0

Docling

Enterprise data rarely arrives in clean, structured formats. In the real world, valuable information is buried in PDFs, Word documents, ELAN files, transcribed field notes, or multilingual glossed text. If you’re building AI-infused applications: A chatbot that explains policy documents, a RAG pipeline that pulls context from user manuals, or a smart index for legal archives, you need a reliable, scalable way to convert these formats into structured, machine-consumable data.

In this hands-on tutorial, we’ll use Quarkus and the Docling extension to build a REST API that transforms unstructured documents into clean JSON, TSV, or XML formats. This data is ready for downstream use: embedding, chunking, vector search, or even linguistic analysis.

Docling and Docling Serve

The Docling Serve project is a lightweight server implementation for the Docling document transformation engine, designed to expose Docling’s powerful format conversion capabilities over a simple HTTP API. It acts as the backend service behind client libraries like the Quarkus Docling extension, enabling developers to convert complex linguistic, annotated, or unstructured document formats, such as ELAN (.eaf), Toolbox, DOCX, or PDF, into structured outputs like JSON, TSV, or XML. Docling Serve is ideal for embedding into NLP pipelines, AI backend services, or digital humanities tools where text segmentation, speaker identification, or gloss extraction are required. It runs as a stateless container and is optimized for easy integration and scalability.

Why This Problem Matters

Most enterprise AI projects don’t fail at the model layer. They fail in the messy middle: data preparation. Consider:

  • Business knowledge lives in PDFs. Contracts, datasheets, manuals, and policies are almost always stored as .pdf.

  • Word documents dominate collaboration. Internal playbooks, meeting notes, and feedback are often .docx.

  • Legacy and linguistic projects use ELAN, Toolbox, or FLEx. Parsing these formats reliably is non-trivial.

If you try to shove these formats directly into a vector DB or LLM pipeline, you’ll get garbage in, garbage out. You need semantic segmentation, metadata extraction, and structure preservation before anything else.

That’s what Docling offers. It understands structured annotations, interlinear glossed text, speaker metadata, and other linguistic features. It also handles common business formats like PDF and DOCX and emits clean, chunkable outputs.

Let’s build an application that wraps all of that in a fast Quarkus service.

Prerequisites

To follow along, make sure you have the following installed:

  • Java 17+

  • Maven

  • Podman (with a running Podman Machine)

  • An IDE (e.g., IntelliJ IDEA or VS Code)

No need to manually install Docling. Quarkus will take care of that for you when you start the dev service. The complete project is in my Github repository if you prefer to start from there.

Bootstrap Your Quarkus Project

mvn io.quarkus.platform:quarkus-maven-plugin:create \
    -DprojectGroupId=com.ibm.developer \
    -DprojectArtifactId=quarkus-docling-converter \
    -Dextensions="rest-jackson,quarkus-docling"
cd quarkus-docling-converter
Enter fullscreen mode Exit fullscreen mode

This sets up your project with the Docling extension and Jackson-based JSON support.

Implement the Core Docling Service

Currently, the quarkus-docling extension is a set of wrappers around the Docling Serve project, which exposes Docling as a REST API. It also provides a Dev Service and Dev UI integrations.

Create a class Docling.java in com.ibm.developer:

@ApplicationScoped
public class Docling {

    @Inject
    DoclingApi doclingApi;

    public ConvertDocumentResponse convertFromUrl(URI uri, OutputFormat outputFormat) {
        HttpSource source = new HttpSource();
        source.setUrl(uri);

        ConversionRequest request = new ConversionRequest()
            .addHttpSourcesItem(source)
            .options(new ConvertDocumentsOptions().toFormats(List.of(outputFormat)));

        return doclingApi.processUrlV1alphaConvertSourcePost(request);
    }

    public ConvertDocumentResponse convertFromBytes(byte[] content, String filename, OutputFormat outputFormat) {
        String base64 = Base64.getEncoder().encodeToString(content);
        return convertFromBase64ToText(base64, filename, outputFormat);
    }

    public ConvertDocumentResponse convertFromBase64ToText(String base64, String filename, OutputFormat outputFormat) {
        FileSource source = new FileSource()
            .base64String(base64)
            .filename(filename);

        ConversionRequest request = new ConversionRequest()
            .addFileSourcesItem(source)
            .options(new ConvertDocumentsOptions().toFormats(List.of(outputFormat)));

        return doclingApi.processUrlV1alphaConvertSourcePost(request);
    }
}

Enter fullscreen mode Exit fullscreen mode

This handles file uploads to the Docling service and remote URLs, converting any supported input format to structured output.

Create a REST Endpoint

Rename GreetingResource to ConverterResource.java and replace the content with:

@Path("/convert")
@Consumes(MediaType.MULTIPART_FORM_DATA)
@Produces(MediaType.TEXT_PLAIN)
public class ConverterResource {

    @Inject
    Docling docling;

    String textContent = "";

    @POST
    public Response convert(@RestForm("file") FileUpload file) {
        if (file == null) {
            return Response.status(Response.Status.BAD_REQUEST)
                    .entity("Error: No file uploaded.").build();
        }

        try {

            byte[] imageBytes = Files.readAllBytes(file.uploadedFile());

            ConvertDocumentResponse result = docling.convertFromBytes(
                    imageBytes,
                    file.fileName(),
                    OutputFormat.TEXT);

            this.textContent = result.getDocument().getTextContent();

        } catch (java.io.IOException e) {
            return Response.status(Response.Status.BAD_REQUEST)
                    .entity("Failed to read uploaded file: " + e.getMessage())
                    .build();
        }

        return Response.ok(textContent).build();
    }
}
Enter fullscreen mode Exit fullscreen mode
  1. File Upload : Accepts a file via multipart form data

  2. Validation : Checks if file exists and has a valid MIME type

  3. Conversion : Uses Docling service to convert the document bytes to text format

  4. Response : Returns the extracted text

Try It Out

Grab a sample PDF that is simple enough for your local Docling runtime but complex enough to show some of the power of Docling. I just grabbed a random PDF from redhat.com as an example.

Red Hat whitepaper

Now send this to the endpoint:

curl -F "file=@sample.pdf" \
     -F "outputFormat=json" \
     http://localhost:8080/convert
Enter fullscreen mode Exit fullscreen mode

And you will see the complete text output including base64 encoded images in the response.

Going Further

There’s a lot more that can be done. For now, I will leave you here to take this further with your own experimentations. Keep in mind that this is just the very beginning or a Docling integration with Quarkus and the eventual goal is to unify the DoclingDocument format with LangChain4j’sDocumentabstraction so that Docling can be used in a LangChain4j RAG pipeline for ingesting data.

What you could to today if you like:

  • Add a /formats endpoint to expose available input/output formats

  • Support bulk conversion from ZIP files

  • Add integration with LangChain4j to process the output directly

  • Store converted chunks in a vector DB like Weaviate or pg-vector

With Docling and Quarkus, you now have a scalable, flexible foundation for turning unstructured documents into structured inputs for AI. Your models are only as good as the data they see: Make that data count.

Comments 0 total

    Add comment