Real-Time Voice Meets RAG: Building a Domain-Specific AI Chatbot

#devchallenge #assemblyaichallenge #ai #api

K Om Senapati

K Om Senapati @k0msenapati

About: Hi, I'm K Om Senapati, a B Tech CSE student at OUTR, Bhubaneswar, and a Pythonista passionate about hackathons, teamwork, and exploring new technologies.

Location:

Bhubaneswar, India

Joined:

Sep 5, 2023

Real-Time Voice Meets RAG: Building a Domain-Specific AI Chatbot

Publish Date: Jul 23 '25

73 6

This is a submission for the AssemblyAI Voice Agents Challenge

What I Built

Built a small side project recently: a voice-based chatbot that answers sociology questions using a domain-trained RAG agent. It’s called Sociopal.

It’s powered by LangGraph, does corrective RAG, and can also search the web when it doesn’t have the answer. AssemblyAI handles speech-to-text, and ElevenLabs takes care of the speech output.

You ask a sociology-related question using your voice. The app transcribes your voice to text, queries a backend agent trained on sociology docs, and gives a response. If the answer isn’t found in the vector DB, it falls back to web search and tries again.

The final response is both displayed and spoken aloud using ElevenLabs.

Demo

Not deployed yet, but here’s a short demo video:

GitHub Repository

⭐ Github 👇

k0msenapati / sociopal

Sociopal

A domain expert AI voice agent for sociology.

Learn more about Sociopal

Sociopal is a Corrective RAG (CRAG) agent powered by a vectorDB containing curated sociology information and web search. It is designed to answer questions and provide detailed explanations related to sociology.

Technology Stack

Frontend:

Next.js
AssemblyAI (speech-to-text)
ElevenLabs (text-to-speech)

Backend:

FastAPI
LangGraph
Groq
ChromaDB
DuckDuckGo (web search)

Getting Started

1. Clone the Repository

git clone https://github.com/k0msenapati/sociopal.git

2. Navigate to the Project Directory

cd sociopal

Frontend Setup

cd ui
bun i
cp .env.example .env.local

Fill in your ElevenLabs and AssemblyAI API keys in .env.local.

Start the development server:

bun dev

Backend Setup

cd ../agent-py
uv sync
source .venv/bin/activate
cp .env.example .env

Fill in your Groq API key in .env.

Index the Data

uv run --active -m sociology_agent.index

Run the Server

uv run --active uvicorn sociology_agent.server:app --reload

Installation steps are included in the README.

Technical Implementation

AssemblyAI Integration

I’m using AssemblyAI’s Universal-Streaming API to handle real-time voice input. Here’s the rough flow:

1. Getting a Temporary Token

There's an API route (/api/token) that fetches a temporary token:

const url = `https://streaming.assemblyai.com/v3/token?expires_in_seconds=60`

2. Connecting via WebSocket

Once the token is ready, a WebSocket connection is opened to stream audio:

wss://streaming.assemblyai.com/v3/ws?sample_rate=16000&formatted_finals=true&token=${token}

On the frontend, I use getUserMedia() to access the mic, then convert the audio to 16-bit PCM and send it over the socket. AssemblyAI returns transcripts in real time, which I display as the user speaks.

It works smoothly with low latency, and transcripts are surprisingly accurate even with casual speech.

Backend Agent

The backend runs a FastAPI app with a /query route. It accepts user queries, passes them to the LangGraph agent, and returns the response.

The agent uses corrective RAG, so if the first answer is incomplete or irrelevant, it will retry with a refined query. It’s also hooked up to a web search tool in case the answer isn’t in the vectorDB.

Final Thoughts

Building this was a fun way to explore how voice can enhance AI agents. Using real-time transcription with AssemblyAI and natural-sounding speech with ElevenLabs made the voice interface smooth to implement.

While this one is trained on sociology data, the setup is actually domain-agnostic. You can swap out the vector database with any other domain-specific content, and the agent will still work just as well.

Definitely worth trying if you're into voice UIs or building smarter assistants.

Thanks for reading, and I look forward to connecting with you again soon!

Follow me for more content like this!

Twitter | GitHub | YouTube

Comments 6 total

AbhinavJul 23, 2025
Amazing 🫡
Pheonix Coder 🐦‍🔥Jul 23, 2025
Voice addition is just 🫰
Rohan SharmaJul 23, 2025
Great job, man!
Tuhin BanerjeeJul 23, 2025
cool stuff...
Ayush JhawarJul 26, 2025
Great Work 🎉
ZacharyIrbyJul 26, 2025
The concept of combining real-time voice with RAG (Retrieval-Augmented Generation) for a domain-specific AI chatbot is genuinely exciting—it’s like giving a voice to institutional knowledge with the intelligence to adapt on the fly. I loved how the article emphasized latency, relevance, and conversational flow. It reminds me of how NV Casino designs its live dealer games: real-time interaction backed by a smart, responsive backend. The tech might differ, but the goal’s the same—seamless, engaging experiences that feel personal.

If you were to build a voice-enabled chatbot for a specific domain, which industry would you pick first and why?