Building a Privacy-First Document QA System with Gaia and Qdrant
Harish Kotra (he/him)

Harish Kotra (he/him) @harishkotra

About: I do/build: Chatbots | Hackathons | WordPress | Developer Events | Web & Mobile Apps

Location:
Hyderabad, India
Joined:
Sep 14, 2018

Building a Privacy-First Document QA System with Gaia and Qdrant

Publish Date: Aug 20
0 0

Here's an example project that helps you run PDF to RAG and query your PDFs locally using a local Gaia node.

What and Why?

Many of us work with PDFs daily - technical documentation, research papers, legal documents, and more. While tools like ChatGPT can help understand these documents, they require uploading potentially sensitive information to external servers. Additionally, the responses aren't always grounded in the source material, leading to potential hallucinations.

How?

Gaia PDF RAG addresses these challenges by combining several powerful technologies:

  1. Local LLM processing using Gaia nodes
  2. Efficient vector search with Qdrant
  3. Smart reranking using cross-encoders
  4. Privacy-first architecture

Let's dive into how it works and how you can use it.

Code Overview

1. Document Processing

The first step is processing PDF documents into manageable chunks. Here's how we do it:

def process_document(uploaded_file: UploadedFile) -> List[Document]:
    """Process uploaded PDF file into text chunks."""
    temp_file = tempfile.NamedTemporaryFile("wb", suffix=".pdf", delete=False)
    temp_file.write(uploaded_file.read())
    temp_file.close()

    loader = PyMuPDFLoader(temp_file.name)
    docs = loader.load()
    os.unlink(temp_file.name)

    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=400,
        chunk_overlap=100,
        separators=["\n\n", "\n", ".", "?", "!", " ", ""],
    )
    return text_splitter.split_documents(docs)
Enter fullscreen mode Exit fullscreen mode

This code:

  • Handles PDF uploads
  • Splits documents into semantic chunks
  • Preserves context through overlap
  • Cleans up temporary files

2. Vector Storage with Qdrant

We use Qdrant for efficient vector storage and retrieval:

def init_collection(client: QdrantClient):
    """Initialize Qdrant collection if it doesn't exist or has wrong dimensions"""
    try:
        collection_info = client.get_collection(COLLECTION_NAME)
        current_size = collection_info.config.params.vectors.size
        if current_size != VECTOR_SIZE:
            client.delete_collection(COLLECTION_NAME)
            raise Exception("Collection deleted due to dimension mismatch")
    except Exception:
        client.create_collection(
            collection_name=COLLECTION_NAME,
            vectors_config=VectorParams(size=VECTOR_SIZE, distance=Distance.COSINE),
        )
Enter fullscreen mode Exit fullscreen mode

This ensures:

  • Proper vector dimensions
  • Cosine similarity search
  • Efficient storage and retrieval

3. Smart Reranking

One key innovation is the use of cross-encoders for reranking:

def re_rank_cross_encoders(prompt: str, documents: List[str]) -> Tuple[str, List[int]]:
    """Re-rank documents using cross-encoder model."""
    encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
    ranks = encoder.rank(prompt, documents, top_k=3)

    relevant_text = ""
    relevant_text_ids = []
    for rank in ranks:
        relevant_text += documents[rank["corpus_id"]]
        relevant_text_ids.append(rank["corpus_id"])

    return relevant_text, relevant_text_ids
Enter fullscreen mode Exit fullscreen mode

This improves accuracy by:

  • Re-scoring candidate passages
  • Considering full context
  • Filtering irrelevant results

4. Integration with Gaia

The local LLM integration happens through the Gaia node:

def call_gaia_llm(context: str, prompt: str):
    """Call local Gaia node for chat completion."""
    messages = [
        {
            "role": "system",
            "content": system_prompt
        },
        {
            "role": "user", 
            "content": f"Context: {context}\nQuestion: {prompt}"
        }
    ]

    response = requests.post(
        f"{GAIA_NODE_URL}/chat/completions",
        json={
            "messages": messages,
            "stream": True
        },
        stream=True
    )
Enter fullscreen mode Exit fullscreen mode

Results and Benefits

The combination of these technologies provides several advantages:

  1. Privacy: All processing happens locally
  2. Accuracy: Cross-encoder reranking ensures relevant results
  3. Speed: Local processing means fast responses
  4. Cost: No API fees or usage limits
  5. Flexibility: Easy to customize and extend

Getting Started

Want to try it yourself? Here's how:

  1. Set up your environment:
git clone https://github.com/harishkotra/gaia-pdf-rag.git
cd gaia-pdf-rag
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
Enter fullscreen mode Exit fullscreen mode
  1. Start the required services:
# Start Gaia node
gaianet init
gaianet start

# Start Qdrant
docker run -d -p 6333:6333 qdrant/qdrant
Enter fullscreen mode Exit fullscreen mode
  1. Run the application:
streamlit run app.py
Enter fullscreen mode Exit fullscreen mode

Future Developments

This project is just the beginning. Future plans include:

  • Multi-document support
  • Additional file formats
  • Custom embedding models
  • Enhanced reranking strategies
  • Document summarization

Contribute

Gaia PDF RAG demonstrates that we can have powerful AI capabilities without compromising on privacy. By leveraging local LLMs, efficient vector search, and smart reranking, we can build tools that are both powerful and privacy-respecting.

The project is open source and welcomes contributions. Check it out on GitHub and give it a try!

Credits

Inspired by this example.

Comments 0 total

    Add comment