Building Production-Ready RAG Systems with Gaia and Weaviate

TL;DR

This post demonstrates how to build a production-ready Retrieval Augmented Generation (RAG) system using:

🌐 Gaia: Decentralized AI infrastructure with OpenAI-compatible APIs
🗄️ Weaviate: Advanced vector database replacing traditional solutions
📊 Real-World Data: Live integration with Wikipedia, ArXiv, GitHub, and news sources

Key Result: A complete RAG pipeline that processes 50+ documents, performs semantic search, and generates responses using decentralized AI infrastructure.

🎯 Why This Matters

Traditional RAG systems rely on centralized providers like OpenAI, creating single points of failure and vendor lock-in. This architecture demonstrates:

Decentralization: Use public Gaia nodes instead of centralized APIs
Flexibility: Replace built-in vector stores with specialized solutions
Real-World Data: Process live data from multiple internet sources
Production Ready: Environment configuration, health monitoring, error handling

🧠 Understanding the Platforms

Gaia: Decentralized AI Infrastructure

What is Gaia?
Gaia is a decentralized infrastructure for AI agents that provides OpenAI-compatible APIs while running on distributed nodes.

Key Features:

OpenAI Compatibility: Drop-in replacement for OpenAI APIs
Decentralized: No single point of failure
Model Flexibility: Support for Llama, Qwen, Gemma, and other open models

Example Gaia Node:

https://0x299eae67ba6bbae8d61faad2d70115dc5a6855c8.gaia.domains/v1

Example Gaia Node Config:

{
  "address": "",
  "chat": "https://huggingface.co/gaianet/gemma-3-4b-it-GGUF/resolve/main/gemma-3-4b-it-Q5_K_M.gguf",
  "chat_batch_size": "128",
  "chat_ctx_size": "8192",
  "chat_name": "Gemma-3.4B-IT",
  "chat_ubatch_size": "128",
  "context_window": "1",
  "description": "Gaia node running with Gemma-3.4B-IT model without any knowledgebase.",
  "domain": "gaia.domains",
  "embedding": "https://huggingface.co/gaianet/gte-Qwen2-1.5B-instruct-GGUF/resolve/main/gte-Qwen2-1.5B-instruct-f16.gguf",
  "embedding_batch_size": "8192",
  "embedding_collection_name": "default",
  "embedding_ctx_size": "8192",
  "embedding_name": "gte-Qwen2-1.5B-instruct-f16",
  "embedding_ubatch_size": "8192",
  "llamaedge_chat_port": "9075",
  "llamaedge_embedding_port": "9076",
  "llamaedge_port": "8086",
  "prompt_template": "gemma-3",
  "qdrant_limit": "1",
  "qdrant_score_threshold": "0.5",
  "rag_policy": "system-message",
  "rag_prompt": "Use the following information to answer the question.\n----------------\n",
  "reverse_prompt": "",
  "snapshot": "",
  "system_prompt": "You're a helpful assistant"
}

5. RAG Pipeline Implementation

Complete RAG flow with context integration:

Weaviate: Advanced Vector Database

What is Weaviate?
Weaviate is an open-source vector database designed for AI applications, offering advanced features beyond simple vector storage.

Why Choose Weaviate Over Qdrant?

Weaviate Vectorizer Options:

# Local embeddings (no API key needed)
VECTORIZER_MODULE=text2vec-transformers

# OpenAI embeddings
VECTORIZER_MODULE=text2vec-openai
OPENAI_API_KEY=your-key

# Cohere embeddings
VECTORIZER_MODULE=text2vec-cohere
COHERE_API_KEY=your-key

🏗️ System Architecture

🛠️ Implementation Deep Dive

1. Start Weaviate with Docker Compose

Create a docker-compose.yml file with this production-ready configuration:

---
services:
  weaviate:
    command:
    - --host
    - 0.0.0.0
    - --port
    - '8080'
    - --scheme
    - http
    image: cr.weaviate.io/semitechnologies/weaviate:1.30.0
    ports:
    - 8080:8080
    - 50051:50051
    restart: on-failure:0
    environment:
      TRANSFORMERS_INFERENCE_API: 'http://t2v-transformers:8080'
      QNA_INFERENCE_API: 'http://qna-transformers:8080'
      OPENAI_APIKEY: $OPENAI_APIKEY
      QUERY_DEFAULTS_LIMIT: 25
      AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
      PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
      DEFAULT_VECTORIZER_MODULE: 'text2vec-transformers'
      ENABLE_MODULES: 'text2vec-transformers,qna-transformers,generative-openai'
      CLUSTER_HOSTNAME: 'node1'
  t2v-transformers:
    image: cr.weaviate.io/semitechnologies/transformers-inference:sentence-transformers-multi-qa-MiniLM-L6-cos-v1
    environment:
      ENABLE_CUDA: '0'
  qna-transformers:
    image: cr.weaviate.io/semitechnologies/qna-transformers:distilbert-base-uncased-distilled-squad
    environment:
      ENABLE_CUDA: '0'

Start Weaviate:

docker compose up -d

This configuration provides:

Latest Weaviate: Version 1.30.0 with latest features
Multiple Vectorizers: text2vec-transformers + QnA transformers
Production Ready: Proper restart policies and persistence
GPU Support: Set ENABLE_CUDA: '1' if you have NVIDIA GPU

# Gaia Node Configuration
GAIA_BASE_URL=https://0x299eae67ba6bbae8d61faad2d70115dc5a6855c8.gaia.domains/v1
GAIA_API_KEY=test-key
GAIA_MODEL_NAME=Gemma-3.4B-IT

# Weaviate Configuration
WEAVIATE_HOST=localhost
WEAVIATE_PORT=8080
WEAVIATE_USE_AUTH=false

# Vector Configuration
VECTORIZER_MODULE=text2vec-transformers
DEFAULT_COLLECTION_NAME=RealWorldKnowledgeBase

# Generation Parameters
MAX_TOKENS=300
TEMPERATURE=0.7
SEARCH_LIMIT=3

# Performance Tuning
BATCH_SIZE=100
CONNECTION_TIMEOUT=30
DEBUG=true

2. Environment Configuration

The system uses a comprehensive .env configuration for production readiness:

The system fetches real-world data from multiple sources:

Wikipedia Integration

class WikipediaSource(DataSource):
    def fetch_data(self, topics: List[str]) -> List[Dict[str, Any]]:
        for topic in topics:
            # Fetch full article content
            params = {
                'action': 'query',
                'format': 'json', 
                'titles': topic,
                'prop': 'extracts',
                'explaintext': True
            }
            # Process and chunk content
            chunks = self.chunk_text(content, max_length=1500)

ArXiv Research Papers

class ArXivSource(DataSource):
    def fetch_data(self, search_terms: List[str]) -> List[Dict[str, Any]]:
        for term in search_terms:
            params = {
                'search_query': f'all:{term}',
                'sortBy': 'submittedDate',
                'sortOrder': 'descending'
            }
            # Parse XML response and extract metadata

GitHub Documentation

class GitHubSource(DataSource):
    def fetch_data(self, repos: List[str]) -> List[Dict[str, Any]]:
        for repo in repos:
            # Fetch README via GitHub API
            readme_url = f"https://api.github.com/repos/{repo}/readme"
            # Decode base64 content and process

3. Data Source Integration

The system fetches real-world data from multiple sources:

Advanced schema with nested properties for rich metadata:

collection = weaviate_client.collections.create(
    name="RealWorldKnowledgeBase",
    vectorizer_config=Configure.Vectorizer.text2vec_transformers(),
    properties=[
        Property(name="title", data_type=DataType.TEXT),
        Property(name="content", data_type=DataType.TEXT),
        Property(name="source", data_type=DataType.TEXT),
        Property(name="category", data_type=DataType.TEXT),
        Property(
            name="metadata", 
            data_type=DataType.OBJECT,
            nested_properties=[
                Property(name="url", data_type=DataType.TEXT),
                Property(name="author", data_type=DataType.TEXT),
                Property(name="published", data_type=DataType.TEXT),
                Property(name="difficulty", data_type=DataType.TEXT),
                Property(name="topic", data_type=DataType.TEXT),
                Property(name="tags", data_type=DataType.TEXT_ARRAY),
                Property(name="fetched_at", data_type=DataType.TEXT),
                Property(name="chunk_index", data_type=DataType.INT),
                Property(name="total_chunks", data_type=DataType.INT),
            ]
        ),
    ]
)

4. Weaviate Schema Design

Advanced schema with nested properties for rich metadata:

Complete RAG flow with context integration:

def rag_query(self, query: str, collection_name: str = None) -> Dict[str, Any]:
    # Step 1: Vector search in Weaviate
    relevant_docs = self.search_knowledge(query, collection_name)

    # Step 2: Prepare context for LLM
    context_parts = []
    for doc in relevant_docs:
        context_parts.append(f"Title: {doc['title']}\nContent: {doc['content']}")
    context = "\n\n".join(context_parts)

    # Step 3: Generate response with Gaia node
    response = self.llm_client.chat.completions.create(
        model="Gemma-3.4B-IT",
        messages=[
            {"role": "system", "content": f"Use this context: {context}"},
            {"role": "user", "content": query}
        ],
        max_tokens=self.config.MAX_TOKENS,
        temperature=self.config.TEMPERATURE
    )

    return {
        "query": query,
        "response": response.choices[0].message.content,
        "sources": relevant_docs,
        "model_used": "Gemma-3.4B-IT"
    }

📊 Real-World Results

Performance Metrics

Here are the actual results from our demo run:

🎉 Total documents collected: 57
  - 📖 Wikipedia: 7 articles (chunked into multiple docs)
  - 🔬 ArXiv: 6 research papers
  - 📂 GitHub: 37 documentation chunks
  - 📰 News: 6 recent articles

⏱️ Processing Time: 92.0 seconds (quick mode)
🔍 Search Performance: ~50ms per query
🤖 Generation Speed: ~2-5 seconds per response
💾 Storage: 63,479 characters across 57 documents
📊 Average document size: 1,113 characters

Sample Query Results

Here are real responses from our system:

Query: "What are the latest developments in large language models?"

Response: Based on the provided context, here's a summary of recent developments in large language models:

GPT Models: The largest and most capable LLMs are generative pretrained transformers (GPTs), like ChatGPT, Gemini, and Claude, which are widely used in generative chatbots.
Fine-tuning & Prompt Engineering: LLMs can be adapted for specific tasks through fine-tuning or guided by prompt engineering.
Integration with other tools: Semantic Kernel integrates with LLMs from OpenAI, Azure OpenAI, Hugging Face, and Nvidia. Weaviate also supports Cohere Multilingual ML Models.

The documents don't provide a comprehensive list of all the latest developments, but this captures the key trends highlighted in the given information.

Sources Used:

Large language model (Wikipedia) - Category: encyclopedia
weaviate Documentation (Part 6) - Category: documentation
semantic-kernel Documentation (Part 2) - Category: documentation

Query: "How does retrieval augmented generation work?"

Response: Here's how retrieval-augmented generation (RAG) works, based on the provided context:

Collect and Embed Documents: You gather text documents related to a specific topic or domain.
Create Feature Vectors: For each document (or section of a document), you use a deep learning network to create a "feature vector" – often called an "embedding." This vector represents the meaning of the text in a numerical format.
Store in a Vector Database: These feature vectors are stored in a special database designed for quickly searching through large amounts of data based on similarity.
User Prompt: When a user asks a question or provides a prompt, the same process is applied to that prompt – it's converted into a feature vector.
Retrieve Relevant Documents: The vector database searches for documents whose feature vectors are most similar to the prompt's feature vector. This identifies the most relevant pieces of information.
Add to Context: These retrieved documents are then added to what's called the "context window" – essentially, they become part of the input provided to a large language model (LLM).
Generate Response: The LLM uses this expanded context to generate a response to the user's original prompt.

Sources Used:

Vector database (Part 2) - Category: encyclopedia
FADRM: Fast and Accurate Data Residual Matching for Dataset Distillation - Category: research
Controlling the false discovery rate under a non-parametric graphical dependence model - Category: research

Query: "How do I use the OpenAI API?"

Response: Here's how you can use the OpenAI API with the Python library, based on the provided documentation:

1. Installation:

pip install openai

2. Basic Usage (Text Generation):

import openai

client = openai.OpenAI()
response = client.chat.completions.create(model="gpt-3.5-turbo")
print(response.choices[0].message.content)

3. Handling Errors:

import openai

client = openai.OpenAI()
try:
    client.fine_tuning.jobs.create(model="gpt-4o", training_file="file-abc123")
except openai.APIConnectionError as e:
    print("The server could not be reached")
    print(e.__cause__)
except openai.APIStatusError as e:
    print(f"API Error: {e.status_code}")
    print(e.response)

Sources Used:

openai-python Documentation (Part 1) - Category: documentation
openai-python Documentation (Part 11) - Category: documentation
openai-python Documentation (Part 19) - Category: documentation

Data Source Statistics

📋 Categories:
  documentation: 37 documents
  encyclopedia: 7 documents
  metadata: 1 documents
  research: 6 documents
  tech_news: 6 documents

🌐 Sources:
  arxiv: 6 documents
  collection: 1 documents
  github: 37 documents
  rss: 6 documents
  wikipedia: 7 documents

💾 Weaviate Collection:
  Collection name: RealWorldKnowledgeBase
  Documents in collection: 57
  Vectorizer: text2vec-transformers

🎯 Production Use Cases

1. AI Research Assistant

Scenario: Researchers need up-to-date information about AI developments

Data Sources: ArXiv papers, Wikipedia articles, GitHub repositories
Query Examples:

"What are the latest developments in retrieval augmented generation?"
"How do transformer architectures work?"
"What are the current challenges in LLM training?"

2. Technical Documentation Helper

Scenario: Developers need help with API integration and implementation

Data Sources: GitHub READMEs, API documentation, technical guides
Query Examples:

"How do I integrate OpenAI API with my application?"
"What are the best practices for vector database setup?"
"How to implement RAG with Weaviate?"

3. News and Trends Analyzer

Scenario: Businesses need insights into industry developments and market trends

Data Sources: TechCrunch, Hacker News, AI News feeds, industry reports
Query Examples:

"What are the recent AI funding rounds and acquisitions?"
"What companies are leading in AI innovation?"
"What are the current regulatory challenges for AI?"

4. Educational Content Generator

Scenario: Educators and content creators need accurate, well-sourced explanations

Data Sources: Wikipedia, academic papers, documentation, tutorials
Query Examples:

"Explain machine learning to beginners"
"What is the difference between supervised and unsupervised learning?"
"How do neural networks process information?"

🔧 Technical Implementation Details

Available Models on Our Node:

Gemma-3.4B-IT: Google's instruction-tuned model (3.4B parameters)
gte-Qwen2-1.5B-instruct-f16: Qwen-based model optimized for efficiency (1.5B parameters)

Actual Configuration from Our Demo:

🔧 Current Configuration:
  Gaia URL: https://0x299eae67ba6bbae8d61faad2d70115dc5a6855c8.gaia.domains/v1
  Weaviate: localhost:8080
  Collection: MyKnowledgeBase
  Vectorizer: text2vec-transformers
  Max Tokens: 300
  Temperature: 0.7
📋 Available models: ['Gemma-3.4B-IT', 'gte-Qwen2-1.5B-instruct-f16']

Model Performance Analysis

Our demo showcases two different models available on the Gaia node:

Gemma-3.4B-IT (Google)

Size: 3.4 billion parameters
Type: Instruction-tuned model
Strengths: Excellent for conversational AI and instruction following
Performance: ~2-5 seconds per response (observed in demo)
Use Cases: General Q&A, educational content, technical explanations
Quality: Provides detailed, well-structured responses as seen in our examples

gte-Qwen2-1.5B-instruct-f16 (Alibaba)

Size: 1.5 billion parameters
Type: Instruction-tuned with 16-bit precision
Strengths: Fast inference, good multilingual support
Performance: ~1-2 seconds per response
Use Cases: Quick responses, batch processing, resource-constrained environments

Data Processing Pipeline

Text Chunking Strategy:

def chunk_text(self, text: str, max_length: int = 1500) -> List[str]:
    # Split by sentences to maintain context
    sentences = re.split(r'(?<=[.!?])\s+', text)
    chunks = []
    current_chunk = ""

    for sentence in sentences:
        if len(current_chunk) + len(sentence) <= max_length:
            current_chunk += " " + sentence if current_chunk else sentence
        else:
            if current_chunk:
                chunks.append(current_chunk.strip())
            current_chunk = sentence

    return chunks

Metadata Extraction:

Source tracking: Wikipedia, ArXiv, GitHub, RSS
Category classification: encyclopedia, research, documentation, news
Timestamp tracking: When content was fetched
Author information: Where available
Difficulty levels: beginner, intermediate, advanced

🚀 Production Deployment Considerations

Scaling the System

Horizontal Scaling Options:

Multiple Gaia Nodes: Load balance across different nodes

gaia_nodes = [
    "https://node1.gaia.domains/v1",
    "https://node2.gaia.domains/v1", 
    "https://node3.gaia.domains/v1"
]
# Implement round-robin or weighted distribution

Weaviate Clustering: Scale vector operations

# docker-compose.yml for cluster
services:
  weaviate-node-1:
    image: semitechnologies/weaviate:1.23.7
    environment:
      CLUSTER_HOSTNAME: 'node1'
  weaviate-node-2:
    image: semitechnologies/weaviate:1.23.7
    environment:
      CLUSTER_HOSTNAME: 'node2'

Data Source Distribution: Parallel fetching

# Async data fetching
async def fetch_all_sources():
    tasks = [
        fetch_wikipedia_async(topics),
        fetch_arxiv_async(search_terms),
        fetch_github_async(repos),
        fetch_rss_async(feeds)
    ]
    results = await asyncio.gather(*tasks)
    return flatten(results)

Security and Authentication

Production Security Checklist:

✅ Environment Variables: Never commit API keys
✅ Weaviate Authentication: Enable for production
✅ Rate Limiting: Implement client-side throttling
✅ Input Validation: Sanitize user queries
✅ Network Security: Use HTTPS/TLS encryption
✅ Access Control: Implement user permissions

# Production Weaviate with auth
WEAVIATE_USE_AUTH=true
WEAVIATE_API_KEY=your-secure-production-key

Monitoring and Observability

Health Check Implementation:

def health_check(self) -> Dict[str, Any]:
    health = {
        "timestamp": time.time(),
        "gaia": {"status": "unknown", "models": []},
        "weaviate": {"status": "unknown", "collections": []},
        "overall": "unknown"
    }

    # Test Gaia connection
    try:
        models = self.llm_client.models.list()
        health["gaia"]["status"] = "healthy"
        health["gaia"]["models"] = [m.id for m in models.data]
    except Exception as e:
        health["gaia"]["status"] = f"error: {e}"

    # Test Weaviate connection
    try:
        is_ready = self.weaviate_client.is_ready()
        if is_ready:
            collections = self.weaviate_client.collections.list_all()
            health["weaviate"]["status"] = "healthy"
            health["weaviate"]["collections"] = list(collections.keys())
    except Exception as e:
        health["weaviate"]["status"] = f"error: {e}"

    return health

📈 Performance Optimization Tips

1. Vector Search Optimization

Batch Processing:

# Process multiple queries simultaneously
queries = ["query1", "query2", "query3"]
results = []
for query in queries:
    result = collection.query.near_text(query=query, limit=5)
    results.append(result)

Index Tuning:

# Configure HNSW parameters for better performance
vectorizer_config = Configure.Vectorizer.text2vec_transformers(
    vectorize_class_name=True,
    model_config={
        "ef_construction": 256,  # Higher = better recall, slower build
        "max_connections": 32,   # Higher = better recall, more memory
    }
)

2. LLM Response Optimization

Context Window Management:

def optimize_context(self, docs: List[Dict], max_tokens: int = 2000) -> str:
    context_parts = []
    current_length = 0

    for doc in sorted(docs, key=lambda x: x['score'], reverse=True):
        doc_length = len(doc['content'])
        if current_length + doc_length <= max_tokens:
            context_parts.append(f"Title: {doc['title']}\n{doc['content']}")
            current_length += doc_length
        else:
            break

    return "\n\n".join(context_parts)

Prompt Engineering:

system_prompt = """You are an AI assistant specializing in technical documentation and research. 
Use the provided context to answer questions accurately and cite your sources when possible.
If the context doesn't contain relevant information, say so clearly.

Context:
{context}

Guidelines:
- Be concise but comprehensive
- Use bullet points for lists
- Cite sources when referencing specific information
- If uncertain, acknowledge limitations
"""

3. Data Ingestion Optimization

Smart Caching:

import hashlib
from datetime import datetime, timedelta

def should_refresh_source(source_name: str, max_age_hours: int = 24) -> bool:
    cache_file = f"cache/{source_name}_last_update.txt"
    try:
        with open(cache_file, 'r') as f:
            last_update = datetime.fromisoformat(f.read().strip())
            age = datetime.now() - last_update
            return age > timedelta(hours=max_age_hours)
    except FileNotFoundError:
        return True

Incremental Updates:

def get_new_documents_only(self, source: str, since: datetime) -> List[Dict]:
    # Only fetch documents newer than the timestamp
    # Implement based on source API capabilities
    pass

🔮 Future Enhancements

1. Advanced Retrieval Strategies

Hybrid Search Implementation:

# Combine vector search with keyword search
def hybrid_search(self, query: str, alpha: float = 0.7):
    # Vector search (semantic similarity)
    vector_results = collection.query.near_text(query=query, limit=10)

    # BM25 search (keyword matching)  
    bm25_results = collection.query.bm25(query=query, limit=10)

    # Combine results with weighted scoring
    combined_results = self.combine_results(vector_results, bm25_results, alpha)
    return combined_results

Re-ranking with Cross-Encoders:

from sentence_transformers import CrossEncoder

def rerank_results(self, query: str, documents: List[Dict]) -> List[Dict]:
    reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

    pairs = [(query, doc['content']) for doc in documents]
    scores = reranker.predict(pairs)

    # Re-order documents by cross-encoder scores
    for doc, score in zip(documents, scores):
        doc['rerank_score'] = score

    return sorted(documents, key=lambda x: x['rerank_score'], reverse=True)

2. Multi-Modal Capabilities

Image and Document Processing:

# Future: Add support for PDFs, images, videos
class MultiModalSource(DataSource):
    def process_pdf(self, pdf_path: str) -> List[Dict]:
        # Extract text, images, tables from PDFs
        pass

    def process_image(self, image_path: str) -> Dict:
        # OCR + image description
        pass

3. Advanced Analytics

Query Performance Tracking:

import time
from collections import defaultdict

class AnalyticsTracker:
    def __init__(self):
        self.query_times = defaultdict(list)
        self.popular_queries = defaultdict(int)
        self.source_usage = defaultdict(int)

    def track_query(self, query: str, response_time: float, sources: List[str]):
        self.query_times[query].append(response_time)
        self.popular_queries[query] += 1
        for source in sources:
            self.source_usage[source] += 1

🏁 Conclusion

This implementation demonstrates that building production-ready RAG systems with decentralized infrastructure is not only possible but practical. The combination of Gaia and Weaviate provides:

Key Achievements

✅ Decentralized AI: Successfully replaced OpenAI with public Gaia nodes
✅ Advanced Vector Operations: Weaviate's capabilities exceed basic vector storage
✅ Real-World Data: Live integration with multiple internet sources
✅ Production Features: Configuration management, health monitoring, error handling
✅ Performance: Sub-second search, 2-5 second generation times
✅ Scalability: Architecture supports horizontal scaling

Business Impact

Cost Reduction: No API fees for LLM inference
Vendor Independence: Avoid lock-in with centralized providers
Data Privacy: Keep sensitive data within your infrastructure
Customization: Full control over models and vectorization
Reliability: Distributed infrastructure reduces single points of failure

Technical Benefits

Modern Architecture: Microservices-ready with clean separation of concerns
Flexibility: Easy to swap models, vectorizers, or data sources
Observability: Built-in health checks and performance monitoring
Developer Experience: Environment-based configuration, comprehensive logging

Getting Started

Ready to build your own decentralized RAG system? The complete implementation is available on GitHub with:

📋 Step-by-step setup instructions
🧪 Interactive demo with real data
📊 Performance benchmarks and optimization tips
🛠️ Production deployment guidelines
🔧 Troubleshooting and debugging tools

Repository: https://github.com/GaiaNet-AI/gaia-cookbook/tree/main/python/gaia-weaviate
Demo Video: https://youtu.be/zf9_WFhySho

Harish Kotra (he/him) @harishkotra