Building a Privacy-First Document QA System with Gaia and Qdrant

Here's an example project that helps you run PDF to RAG and query your PDFs locally using a local Gaia node.

What and Why?

Many of us work with PDFs daily - technical documentation, research papers, legal documents, and more. While tools like ChatGPT can help understand these documents, they require uploading potentially sensitive information to external servers. Additionally, the responses aren't always grounded in the source material, leading to potential hallucinations.

How?

Gaia PDF RAG addresses these challenges by combining several powerful technologies:

Local LLM processing using Gaia nodes
Efficient vector search with Qdrant
Smart reranking using cross-encoders
Privacy-first architecture

Let's dive into how it works and how you can use it.

Code Overview

1. Document Processing

The first step is processing PDF documents into manageable chunks. Here's how we do it:

def process_document(uploaded_file: UploadedFile) -> List[Document]:
    """Process uploaded PDF file into text chunks."""
    temp_file = tempfile.NamedTemporaryFile("wb", suffix=".pdf", delete=False)
    temp_file.write(uploaded_file.read())
    temp_file.close()

    loader = PyMuPDFLoader(temp_file.name)
    docs = loader.load()
    os.unlink(temp_file.name)

    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=400,
        chunk_overlap=100,
        separators=["\n\n", "\n", ".", "?", "!", " ", ""],
    )
    return text_splitter.split_documents(docs)

This code:

Handles PDF uploads
Splits documents into semantic chunks
Preserves context through overlap
Cleans up temporary files

2. Vector Storage with Qdrant

We use Qdrant for efficient vector storage and retrieval:

def init_collection(client: QdrantClient):
    """Initialize Qdrant collection if it doesn't exist or has wrong dimensions"""
    try:
        collection_info = client.get_collection(COLLECTION_NAME)
        current_size = collection_info.config.params.vectors.size
        if current_size != VECTOR_SIZE:
            client.delete_collection(COLLECTION_NAME)
            raise Exception("Collection deleted due to dimension mismatch")
    except Exception:
        client.create_collection(
            collection_name=COLLECTION_NAME,
            vectors_config=VectorParams(size=VECTOR_SIZE, distance=Distance.COSINE),
        )

This ensures:

Proper vector dimensions
Cosine similarity search
Efficient storage and retrieval

3. Smart Reranking

One key innovation is the use of cross-encoders for reranking:

def re_rank_cross_encoders(prompt: str, documents: List[str]) -> Tuple[str, List[int]]:
    """Re-rank documents using cross-encoder model."""
    encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
    ranks = encoder.rank(prompt, documents, top_k=3)

    relevant_text = ""
    relevant_text_ids = []
    for rank in ranks:
        relevant_text += documents[rank["corpus_id"]]
        relevant_text_ids.append(rank["corpus_id"])

    return relevant_text, relevant_text_ids

This improves accuracy by:

Re-scoring candidate passages
Considering full context
Filtering irrelevant results

4. Integration with Gaia

The local LLM integration happens through the Gaia node:

def call_gaia_llm(context: str, prompt: str):
    """Call local Gaia node for chat completion."""
    messages = [
        {
            "role": "system",
            "content": system_prompt
        },
        {
            "role": "user", 
            "content": f"Context: {context}\nQuestion: {prompt}"
        }
    ]

    response = requests.post(
        f"{GAIA_NODE_URL}/chat/completions",
        json={
            "messages": messages,
            "stream": True
        },
        stream=True
    )

Results and Benefits

The combination of these technologies provides several advantages:

Privacy: All processing happens locally
Accuracy: Cross-encoder reranking ensures relevant results
Speed: Local processing means fast responses
Cost: No API fees or usage limits
Flexibility: Easy to customize and extend

Getting Started

Want to try it yourself? Here's how:

Set up your environment:

git clone https://github.com/harishkotra/gaia-pdf-rag.git
cd gaia-pdf-rag
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Start the required services:

# Start Gaia node
gaianet init
gaianet start

# Start Qdrant
docker run -d -p 6333:6333 qdrant/qdrant

Run the application:

streamlit run app.py

Future Developments

This project is just the beginning. Future plans include:

Multi-document support
Additional file formats
Custom embedding models
Enhanced reranking strategies
Document summarization

Contribute

Gaia PDF RAG demonstrates that we can have powerful AI capabilities without compromising on privacy. By leveraging local LLMs, efficient vector search, and smart reranking, we can build tools that are both powerful and privacy-respecting.

The project is open source and welcomes contributions. Check it out on GitHub and give it a try!

Credits

Inspired by this example.

Harish Kotra (he/him) @harishkotra