Here's an example project that helps you run PDF to RAG and query your PDFs locally using a local Gaia node.
What and Why?
Many of us work with PDFs daily - technical documentation, research papers, legal documents, and more. While tools like ChatGPT can help understand these documents, they require uploading potentially sensitive information to external servers. Additionally, the responses aren't always grounded in the source material, leading to potential hallucinations.
How?
Gaia PDF RAG addresses these challenges by combining several powerful technologies:
- Local LLM processing using Gaia nodes
- Efficient vector search with Qdrant
- Smart reranking using cross-encoders
- Privacy-first architecture
Let's dive into how it works and how you can use it.
Code Overview
1. Document Processing
The first step is processing PDF documents into manageable chunks. Here's how we do it:
def process_document(uploaded_file: UploadedFile) -> List[Document]:
"""Process uploaded PDF file into text chunks."""
temp_file = tempfile.NamedTemporaryFile("wb", suffix=".pdf", delete=False)
temp_file.write(uploaded_file.read())
temp_file.close()
loader = PyMuPDFLoader(temp_file.name)
docs = loader.load()
os.unlink(temp_file.name)
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=400,
chunk_overlap=100,
separators=["\n\n", "\n", ".", "?", "!", " ", ""],
)
return text_splitter.split_documents(docs)
This code:
- Handles PDF uploads
- Splits documents into semantic chunks
- Preserves context through overlap
- Cleans up temporary files
2. Vector Storage with Qdrant
We use Qdrant for efficient vector storage and retrieval:
def init_collection(client: QdrantClient):
"""Initialize Qdrant collection if it doesn't exist or has wrong dimensions"""
try:
collection_info = client.get_collection(COLLECTION_NAME)
current_size = collection_info.config.params.vectors.size
if current_size != VECTOR_SIZE:
client.delete_collection(COLLECTION_NAME)
raise Exception("Collection deleted due to dimension mismatch")
except Exception:
client.create_collection(
collection_name=COLLECTION_NAME,
vectors_config=VectorParams(size=VECTOR_SIZE, distance=Distance.COSINE),
)
This ensures:
- Proper vector dimensions
- Cosine similarity search
- Efficient storage and retrieval
3. Smart Reranking
One key innovation is the use of cross-encoders for reranking:
def re_rank_cross_encoders(prompt: str, documents: List[str]) -> Tuple[str, List[int]]:
"""Re-rank documents using cross-encoder model."""
encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
ranks = encoder.rank(prompt, documents, top_k=3)
relevant_text = ""
relevant_text_ids = []
for rank in ranks:
relevant_text += documents[rank["corpus_id"]]
relevant_text_ids.append(rank["corpus_id"])
return relevant_text, relevant_text_ids
This improves accuracy by:
- Re-scoring candidate passages
- Considering full context
- Filtering irrelevant results
4. Integration with Gaia
The local LLM integration happens through the Gaia node:
def call_gaia_llm(context: str, prompt: str):
"""Call local Gaia node for chat completion."""
messages = [
{
"role": "system",
"content": system_prompt
},
{
"role": "user",
"content": f"Context: {context}\nQuestion: {prompt}"
}
]
response = requests.post(
f"{GAIA_NODE_URL}/chat/completions",
json={
"messages": messages,
"stream": True
},
stream=True
)
Results and Benefits
The combination of these technologies provides several advantages:
- Privacy: All processing happens locally
- Accuracy: Cross-encoder reranking ensures relevant results
- Speed: Local processing means fast responses
- Cost: No API fees or usage limits
- Flexibility: Easy to customize and extend
Getting Started
Want to try it yourself? Here's how:
- Set up your environment:
git clone https://github.com/harishkotra/gaia-pdf-rag.git
cd gaia-pdf-rag
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
- Start the required services:
# Start Gaia node
gaianet init
gaianet start
# Start Qdrant
docker run -d -p 6333:6333 qdrant/qdrant
- Run the application:
streamlit run app.py
Future Developments
This project is just the beginning. Future plans include:
- Multi-document support
- Additional file formats
- Custom embedding models
- Enhanced reranking strategies
- Document summarization
Contribute
Gaia PDF RAG demonstrates that we can have powerful AI capabilities without compromising on privacy. By leveraging local LLMs, efficient vector search, and smart reranking, we can build tools that are both powerful and privacy-respecting.
The project is open source and welcomes contributions. Check it out on GitHub and give it a try!
Credits
Inspired by this example.