TL;DR
This post demonstrates how to build a production-ready Retrieval Augmented Generation (RAG) system using:
- 🌐 Gaia: Decentralized AI infrastructure with OpenAI-compatible APIs
- 🗄️ Weaviate: Advanced vector database replacing traditional solutions
- 📊 Real-World Data: Live integration with Wikipedia, ArXiv, GitHub, and news sources
Key Result: A complete RAG pipeline that processes 50+ documents, performs semantic search, and generates responses using decentralized AI infrastructure.
🎯 Why This Matters
Traditional RAG systems rely on centralized providers like OpenAI, creating single points of failure and vendor lock-in. This architecture demonstrates:
- Decentralization: Use public Gaia nodes instead of centralized APIs
- Flexibility: Replace built-in vector stores with specialized solutions
- Real-World Data: Process live data from multiple internet sources
- Production Ready: Environment configuration, health monitoring, error handling
🧠 Understanding the Platforms
Gaia: Decentralized AI Infrastructure
What is Gaia?
Gaia is a decentralized infrastructure for AI agents that provides OpenAI-compatible APIs while running on distributed nodes.
Key Features:
- OpenAI Compatibility: Drop-in replacement for OpenAI APIs
- Decentralized: No single point of failure
- Model Flexibility: Support for Llama, Qwen, Gemma, and other open models
Example Gaia Node:
https://0x299eae67ba6bbae8d61faad2d70115dc5a6855c8.gaia.domains/v1
Example Gaia Node Config:
{
"address": "",
"chat": "https://huggingface.co/gaianet/gemma-3-4b-it-GGUF/resolve/main/gemma-3-4b-it-Q5_K_M.gguf",
"chat_batch_size": "128",
"chat_ctx_size": "8192",
"chat_name": "Gemma-3.4B-IT",
"chat_ubatch_size": "128",
"context_window": "1",
"description": "Gaia node running with Gemma-3.4B-IT model without any knowledgebase.",
"domain": "gaia.domains",
"embedding": "https://huggingface.co/gaianet/gte-Qwen2-1.5B-instruct-GGUF/resolve/main/gte-Qwen2-1.5B-instruct-f16.gguf",
"embedding_batch_size": "8192",
"embedding_collection_name": "default",
"embedding_ctx_size": "8192",
"embedding_name": "gte-Qwen2-1.5B-instruct-f16",
"embedding_ubatch_size": "8192",
"llamaedge_chat_port": "9075",
"llamaedge_embedding_port": "9076",
"llamaedge_port": "8086",
"prompt_template": "gemma-3",
"qdrant_limit": "1",
"qdrant_score_threshold": "0.5",
"rag_policy": "system-message",
"rag_prompt": "Use the following information to answer the question.\n----------------\n",
"reverse_prompt": "",
"snapshot": "",
"system_prompt": "You're a helpful assistant"
}
5. RAG Pipeline Implementation
Complete RAG flow with context integration:
Weaviate: Advanced Vector Database
What is Weaviate?
Weaviate is an open-source vector database designed for AI applications, offering advanced features beyond simple vector storage.
Why Choose Weaviate Over Qdrant?
Weaviate Vectorizer Options:
# Local embeddings (no API key needed)
VECTORIZER_MODULE=text2vec-transformers
# OpenAI embeddings
VECTORIZER_MODULE=text2vec-openai
OPENAI_API_KEY=your-key
# Cohere embeddings
VECTORIZER_MODULE=text2vec-cohere
COHERE_API_KEY=your-key
🏗️ System Architecture
🛠️ Implementation Deep Dive
1. Start Weaviate with Docker Compose
Create a docker-compose.yml
file with this production-ready configuration:
---
services:
weaviate:
command:
- --host
- 0.0.0.0
- --port
- '8080'
- --scheme
- http
image: cr.weaviate.io/semitechnologies/weaviate:1.30.0
ports:
- 8080:8080
- 50051:50051
restart: on-failure:0
environment:
TRANSFORMERS_INFERENCE_API: 'http://t2v-transformers:8080'
QNA_INFERENCE_API: 'http://qna-transformers:8080'
OPENAI_APIKEY: $OPENAI_APIKEY
QUERY_DEFAULTS_LIMIT: 25
AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
DEFAULT_VECTORIZER_MODULE: 'text2vec-transformers'
ENABLE_MODULES: 'text2vec-transformers,qna-transformers,generative-openai'
CLUSTER_HOSTNAME: 'node1'
t2v-transformers:
image: cr.weaviate.io/semitechnologies/transformers-inference:sentence-transformers-multi-qa-MiniLM-L6-cos-v1
environment:
ENABLE_CUDA: '0'
qna-transformers:
image: cr.weaviate.io/semitechnologies/qna-transformers:distilbert-base-uncased-distilled-squad
environment:
ENABLE_CUDA: '0'
Start Weaviate:
docker compose up -d
This configuration provides:
- Latest Weaviate: Version 1.30.0 with latest features
- Multiple Vectorizers: text2vec-transformers + QnA transformers
- Production Ready: Proper restart policies and persistence
-
GPU Support: Set
ENABLE_CUDA: '1'
if you have NVIDIA GPU
# Gaia Node Configuration
GAIA_BASE_URL=https://0x299eae67ba6bbae8d61faad2d70115dc5a6855c8.gaia.domains/v1
GAIA_API_KEY=test-key
GAIA_MODEL_NAME=Gemma-3.4B-IT
# Weaviate Configuration
WEAVIATE_HOST=localhost
WEAVIATE_PORT=8080
WEAVIATE_USE_AUTH=false
# Vector Configuration
VECTORIZER_MODULE=text2vec-transformers
DEFAULT_COLLECTION_NAME=RealWorldKnowledgeBase
# Generation Parameters
MAX_TOKENS=300
TEMPERATURE=0.7
SEARCH_LIMIT=3
# Performance Tuning
BATCH_SIZE=100
CONNECTION_TIMEOUT=30
DEBUG=true
2. Environment Configuration
The system uses a comprehensive .env
configuration for production readiness:
The system fetches real-world data from multiple sources:
Wikipedia Integration
class WikipediaSource(DataSource):
def fetch_data(self, topics: List[str]) -> List[Dict[str, Any]]:
for topic in topics:
# Fetch full article content
params = {
'action': 'query',
'format': 'json',
'titles': topic,
'prop': 'extracts',
'explaintext': True
}
# Process and chunk content
chunks = self.chunk_text(content, max_length=1500)
ArXiv Research Papers
class ArXivSource(DataSource):
def fetch_data(self, search_terms: List[str]) -> List[Dict[str, Any]]:
for term in search_terms:
params = {
'search_query': f'all:{term}',
'sortBy': 'submittedDate',
'sortOrder': 'descending'
}
# Parse XML response and extract metadata
GitHub Documentation
class GitHubSource(DataSource):
def fetch_data(self, repos: List[str]) -> List[Dict[str, Any]]:
for repo in repos:
# Fetch README via GitHub API
readme_url = f"https://api.github.com/repos/{repo}/readme"
# Decode base64 content and process
3. Data Source Integration
The system fetches real-world data from multiple sources:
Advanced schema with nested properties for rich metadata:
collection = weaviate_client.collections.create(
name="RealWorldKnowledgeBase",
vectorizer_config=Configure.Vectorizer.text2vec_transformers(),
properties=[
Property(name="title", data_type=DataType.TEXT),
Property(name="content", data_type=DataType.TEXT),
Property(name="source", data_type=DataType.TEXT),
Property(name="category", data_type=DataType.TEXT),
Property(
name="metadata",
data_type=DataType.OBJECT,
nested_properties=[
Property(name="url", data_type=DataType.TEXT),
Property(name="author", data_type=DataType.TEXT),
Property(name="published", data_type=DataType.TEXT),
Property(name="difficulty", data_type=DataType.TEXT),
Property(name="topic", data_type=DataType.TEXT),
Property(name="tags", data_type=DataType.TEXT_ARRAY),
Property(name="fetched_at", data_type=DataType.TEXT),
Property(name="chunk_index", data_type=DataType.INT),
Property(name="total_chunks", data_type=DataType.INT),
]
),
]
)
4. Weaviate Schema Design
Advanced schema with nested properties for rich metadata:
Complete RAG flow with context integration:
def rag_query(self, query: str, collection_name: str = None) -> Dict[str, Any]:
# Step 1: Vector search in Weaviate
relevant_docs = self.search_knowledge(query, collection_name)
# Step 2: Prepare context for LLM
context_parts = []
for doc in relevant_docs:
context_parts.append(f"Title: {doc['title']}\nContent: {doc['content']}")
context = "\n\n".join(context_parts)
# Step 3: Generate response with Gaia node
response = self.llm_client.chat.completions.create(
model="Gemma-3.4B-IT",
messages=[
{"role": "system", "content": f"Use this context: {context}"},
{"role": "user", "content": query}
],
max_tokens=self.config.MAX_TOKENS,
temperature=self.config.TEMPERATURE
)
return {
"query": query,
"response": response.choices[0].message.content,
"sources": relevant_docs,
"model_used": "Gemma-3.4B-IT"
}
📊 Real-World Results
Performance Metrics
Here are the actual results from our demo run:
🎉 Total documents collected: 57
- 📖 Wikipedia: 7 articles (chunked into multiple docs)
- 🔬 ArXiv: 6 research papers
- 📂 GitHub: 37 documentation chunks
- 📰 News: 6 recent articles
⏱️ Processing Time: 92.0 seconds (quick mode)
🔍 Search Performance: ~50ms per query
🤖 Generation Speed: ~2-5 seconds per response
💾 Storage: 63,479 characters across 57 documents
📊 Average document size: 1,113 characters
Sample Query Results
Here are real responses from our system:
Query: "What are the latest developments in large language models?"
Response: Based on the provided context, here's a summary of recent developments in large language models:
- GPT Models: The largest and most capable LLMs are generative pretrained transformers (GPTs), like ChatGPT, Gemini, and Claude, which are widely used in generative chatbots.
- Fine-tuning & Prompt Engineering: LLMs can be adapted for specific tasks through fine-tuning or guided by prompt engineering.
- Integration with other tools: Semantic Kernel integrates with LLMs from OpenAI, Azure OpenAI, Hugging Face, and Nvidia. Weaviate also supports Cohere Multilingual ML Models.
The documents don't provide a comprehensive list of all the latest developments, but this captures the key trends highlighted in the given information.
Sources Used:
- Large language model (Wikipedia) - Category: encyclopedia
- weaviate Documentation (Part 6) - Category: documentation
- semantic-kernel Documentation (Part 2) - Category: documentation
Query: "How does retrieval augmented generation work?"
Response: Here's how retrieval-augmented generation (RAG) works, based on the provided context:
- Collect and Embed Documents: You gather text documents related to a specific topic or domain.
- Create Feature Vectors: For each document (or section of a document), you use a deep learning network to create a "feature vector" – often called an "embedding." This vector represents the meaning of the text in a numerical format.
- Store in a Vector Database: These feature vectors are stored in a special database designed for quickly searching through large amounts of data based on similarity.
- User Prompt: When a user asks a question or provides a prompt, the same process is applied to that prompt – it's converted into a feature vector.
- Retrieve Relevant Documents: The vector database searches for documents whose feature vectors are most similar to the prompt's feature vector. This identifies the most relevant pieces of information.
- Add to Context: These retrieved documents are then added to what's called the "context window" – essentially, they become part of the input provided to a large language model (LLM).
- Generate Response: The LLM uses this expanded context to generate a response to the user's original prompt.
Sources Used:
- Vector database (Part 2) - Category: encyclopedia
- FADRM: Fast and Accurate Data Residual Matching for Dataset Distillation - Category: research
- Controlling the false discovery rate under a non-parametric graphical dependence model - Category: research
Query: "How do I use the OpenAI API?"
Response: Here's how you can use the OpenAI API with the Python library, based on the provided documentation:
1. Installation:
pip install openai
2. Basic Usage (Text Generation):
import openai
client = openai.OpenAI()
response = client.chat.completions.create(model="gpt-3.5-turbo")
print(response.choices[0].message.content)
3. Handling Errors:
import openai
client = openai.OpenAI()
try:
client.fine_tuning.jobs.create(model="gpt-4o", training_file="file-abc123")
except openai.APIConnectionError as e:
print("The server could not be reached")
print(e.__cause__)
except openai.APIStatusError as e:
print(f"API Error: {e.status_code}")
print(e.response)
Sources Used:
- openai-python Documentation (Part 1) - Category: documentation
- openai-python Documentation (Part 11) - Category: documentation
- openai-python Documentation (Part 19) - Category: documentation
Data Source Statistics
📋 Categories:
documentation: 37 documents
encyclopedia: 7 documents
metadata: 1 documents
research: 6 documents
tech_news: 6 documents
🌐 Sources:
arxiv: 6 documents
collection: 1 documents
github: 37 documents
rss: 6 documents
wikipedia: 7 documents
💾 Weaviate Collection:
Collection name: RealWorldKnowledgeBase
Documents in collection: 57
Vectorizer: text2vec-transformers
🎯 Production Use Cases
1. AI Research Assistant
Scenario: Researchers need up-to-date information about AI developments
Data Sources: ArXiv papers, Wikipedia articles, GitHub repositories
Query Examples:
- "What are the latest developments in retrieval augmented generation?"
- "How do transformer architectures work?"
- "What are the current challenges in LLM training?"
2. Technical Documentation Helper
Scenario: Developers need help with API integration and implementation
Data Sources: GitHub READMEs, API documentation, technical guides
Query Examples:
- "How do I integrate OpenAI API with my application?"
- "What are the best practices for vector database setup?"
- "How to implement RAG with Weaviate?"
3. News and Trends Analyzer
Scenario: Businesses need insights into industry developments and market trends
Data Sources: TechCrunch, Hacker News, AI News feeds, industry reports
Query Examples:
- "What are the recent AI funding rounds and acquisitions?"
- "What companies are leading in AI innovation?"
- "What are the current regulatory challenges for AI?"
4. Educational Content Generator
Scenario: Educators and content creators need accurate, well-sourced explanations
Data Sources: Wikipedia, academic papers, documentation, tutorials
Query Examples:
- "Explain machine learning to beginners"
- "What is the difference between supervised and unsupervised learning?"
- "How do neural networks process information?"
🔧 Technical Implementation Details
Available Models on Our Node:
-
Gemma-3.4B-IT
: Google's instruction-tuned model (3.4B parameters) -
gte-Qwen2-1.5B-instruct-f16
: Qwen-based model optimized for efficiency (1.5B parameters)
Actual Configuration from Our Demo:
🔧 Current Configuration:
Gaia URL: https://0x299eae67ba6bbae8d61faad2d70115dc5a6855c8.gaia.domains/v1
Weaviate: localhost:8080
Collection: MyKnowledgeBase
Vectorizer: text2vec-transformers
Max Tokens: 300
Temperature: 0.7
📋 Available models: ['Gemma-3.4B-IT', 'gte-Qwen2-1.5B-instruct-f16']
Model Performance Analysis
Our demo showcases two different models available on the Gaia node:
Gemma-3.4B-IT (Google)
- Size: 3.4 billion parameters
- Type: Instruction-tuned model
- Strengths: Excellent for conversational AI and instruction following
- Performance: ~2-5 seconds per response (observed in demo)
- Use Cases: General Q&A, educational content, technical explanations
- Quality: Provides detailed, well-structured responses as seen in our examples
gte-Qwen2-1.5B-instruct-f16 (Alibaba)
- Size: 1.5 billion parameters
- Type: Instruction-tuned with 16-bit precision
- Strengths: Fast inference, good multilingual support
- Performance: ~1-2 seconds per response
- Use Cases: Quick responses, batch processing, resource-constrained environments
Data Processing Pipeline
Text Chunking Strategy:
def chunk_text(self, text: str, max_length: int = 1500) -> List[str]:
# Split by sentences to maintain context
sentences = re.split(r'(?<=[.!?])\s+', text)
chunks = []
current_chunk = ""
for sentence in sentences:
if len(current_chunk) + len(sentence) <= max_length:
current_chunk += " " + sentence if current_chunk else sentence
else:
if current_chunk:
chunks.append(current_chunk.strip())
current_chunk = sentence
return chunks
Metadata Extraction:
- Source tracking: Wikipedia, ArXiv, GitHub, RSS
- Category classification: encyclopedia, research, documentation, news
- Timestamp tracking: When content was fetched
- Author information: Where available
- Difficulty levels: beginner, intermediate, advanced
🚀 Production Deployment Considerations
Scaling the System
Horizontal Scaling Options:
- Multiple Gaia Nodes: Load balance across different nodes
gaia_nodes = [
"https://node1.gaia.domains/v1",
"https://node2.gaia.domains/v1",
"https://node3.gaia.domains/v1"
]
# Implement round-robin or weighted distribution
- Weaviate Clustering: Scale vector operations
# docker-compose.yml for cluster
services:
weaviate-node-1:
image: semitechnologies/weaviate:1.23.7
environment:
CLUSTER_HOSTNAME: 'node1'
weaviate-node-2:
image: semitechnologies/weaviate:1.23.7
environment:
CLUSTER_HOSTNAME: 'node2'
- Data Source Distribution: Parallel fetching
# Async data fetching
async def fetch_all_sources():
tasks = [
fetch_wikipedia_async(topics),
fetch_arxiv_async(search_terms),
fetch_github_async(repos),
fetch_rss_async(feeds)
]
results = await asyncio.gather(*tasks)
return flatten(results)
Security and Authentication
Production Security Checklist:
✅ Environment Variables: Never commit API keys
✅ Weaviate Authentication: Enable for production
✅ Rate Limiting: Implement client-side throttling
✅ Input Validation: Sanitize user queries
✅ Network Security: Use HTTPS/TLS encryption
✅ Access Control: Implement user permissions
# Production Weaviate with auth
WEAVIATE_USE_AUTH=true
WEAVIATE_API_KEY=your-secure-production-key
Monitoring and Observability
Health Check Implementation:
def health_check(self) -> Dict[str, Any]:
health = {
"timestamp": time.time(),
"gaia": {"status": "unknown", "models": []},
"weaviate": {"status": "unknown", "collections": []},
"overall": "unknown"
}
# Test Gaia connection
try:
models = self.llm_client.models.list()
health["gaia"]["status"] = "healthy"
health["gaia"]["models"] = [m.id for m in models.data]
except Exception as e:
health["gaia"]["status"] = f"error: {e}"
# Test Weaviate connection
try:
is_ready = self.weaviate_client.is_ready()
if is_ready:
collections = self.weaviate_client.collections.list_all()
health["weaviate"]["status"] = "healthy"
health["weaviate"]["collections"] = list(collections.keys())
except Exception as e:
health["weaviate"]["status"] = f"error: {e}"
return health
📈 Performance Optimization Tips
1. Vector Search Optimization
Batch Processing:
# Process multiple queries simultaneously
queries = ["query1", "query2", "query3"]
results = []
for query in queries:
result = collection.query.near_text(query=query, limit=5)
results.append(result)
Index Tuning:
# Configure HNSW parameters for better performance
vectorizer_config = Configure.Vectorizer.text2vec_transformers(
vectorize_class_name=True,
model_config={
"ef_construction": 256, # Higher = better recall, slower build
"max_connections": 32, # Higher = better recall, more memory
}
)
2. LLM Response Optimization
Context Window Management:
def optimize_context(self, docs: List[Dict], max_tokens: int = 2000) -> str:
context_parts = []
current_length = 0
for doc in sorted(docs, key=lambda x: x['score'], reverse=True):
doc_length = len(doc['content'])
if current_length + doc_length <= max_tokens:
context_parts.append(f"Title: {doc['title']}\n{doc['content']}")
current_length += doc_length
else:
break
return "\n\n".join(context_parts)
Prompt Engineering:
system_prompt = """You are an AI assistant specializing in technical documentation and research.
Use the provided context to answer questions accurately and cite your sources when possible.
If the context doesn't contain relevant information, say so clearly.
Context:
{context}
Guidelines:
- Be concise but comprehensive
- Use bullet points for lists
- Cite sources when referencing specific information
- If uncertain, acknowledge limitations
"""
3. Data Ingestion Optimization
Smart Caching:
import hashlib
from datetime import datetime, timedelta
def should_refresh_source(source_name: str, max_age_hours: int = 24) -> bool:
cache_file = f"cache/{source_name}_last_update.txt"
try:
with open(cache_file, 'r') as f:
last_update = datetime.fromisoformat(f.read().strip())
age = datetime.now() - last_update
return age > timedelta(hours=max_age_hours)
except FileNotFoundError:
return True
Incremental Updates:
def get_new_documents_only(self, source: str, since: datetime) -> List[Dict]:
# Only fetch documents newer than the timestamp
# Implement based on source API capabilities
pass
🔮 Future Enhancements
1. Advanced Retrieval Strategies
Hybrid Search Implementation:
# Combine vector search with keyword search
def hybrid_search(self, query: str, alpha: float = 0.7):
# Vector search (semantic similarity)
vector_results = collection.query.near_text(query=query, limit=10)
# BM25 search (keyword matching)
bm25_results = collection.query.bm25(query=query, limit=10)
# Combine results with weighted scoring
combined_results = self.combine_results(vector_results, bm25_results, alpha)
return combined_results
Re-ranking with Cross-Encoders:
from sentence_transformers import CrossEncoder
def rerank_results(self, query: str, documents: List[Dict]) -> List[Dict]:
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
pairs = [(query, doc['content']) for doc in documents]
scores = reranker.predict(pairs)
# Re-order documents by cross-encoder scores
for doc, score in zip(documents, scores):
doc['rerank_score'] = score
return sorted(documents, key=lambda x: x['rerank_score'], reverse=True)
2. Multi-Modal Capabilities
Image and Document Processing:
# Future: Add support for PDFs, images, videos
class MultiModalSource(DataSource):
def process_pdf(self, pdf_path: str) -> List[Dict]:
# Extract text, images, tables from PDFs
pass
def process_image(self, image_path: str) -> Dict:
# OCR + image description
pass
3. Advanced Analytics
Query Performance Tracking:
import time
from collections import defaultdict
class AnalyticsTracker:
def __init__(self):
self.query_times = defaultdict(list)
self.popular_queries = defaultdict(int)
self.source_usage = defaultdict(int)
def track_query(self, query: str, response_time: float, sources: List[str]):
self.query_times[query].append(response_time)
self.popular_queries[query] += 1
for source in sources:
self.source_usage[source] += 1
🏁 Conclusion
This implementation demonstrates that building production-ready RAG systems with decentralized infrastructure is not only possible but practical. The combination of Gaia and Weaviate provides:
Key Achievements
✅ Decentralized AI: Successfully replaced OpenAI with public Gaia nodes
✅ Advanced Vector Operations: Weaviate's capabilities exceed basic vector storage
✅ Real-World Data: Live integration with multiple internet sources
✅ Production Features: Configuration management, health monitoring, error handling
✅ Performance: Sub-second search, 2-5 second generation times
✅ Scalability: Architecture supports horizontal scaling
Business Impact
- Cost Reduction: No API fees for LLM inference
- Vendor Independence: Avoid lock-in with centralized providers
- Data Privacy: Keep sensitive data within your infrastructure
- Customization: Full control over models and vectorization
- Reliability: Distributed infrastructure reduces single points of failure
Technical Benefits
- Modern Architecture: Microservices-ready with clean separation of concerns
- Flexibility: Easy to swap models, vectorizers, or data sources
- Observability: Built-in health checks and performance monitoring
- Developer Experience: Environment-based configuration, comprehensive logging
Getting Started
Ready to build your own decentralized RAG system? The complete implementation is available on GitHub with:
- 📋 Step-by-step setup instructions
- 🧪 Interactive demo with real data
- 📊 Performance benchmarks and optimization tips
- 🛠️ Production deployment guidelines
- 🔧 Troubleshooting and debugging tools
Repository: https://github.com/GaiaNet-AI/gaia-cookbook/tree/main/python/gaia-weaviate
Demo Video: https://youtu.be/zf9_WFhySho