I created LinguaBridge, a real-time bidirectional voice translation app. It utilizes AssemblyAI’s Universal-Streaming API for speech-to-text (STT), Google Gemini for instant translations, and Cartesia's high-performance text-to-speech (TTS) to deliver ultra-low-latency translations, targeting sub-300ms round-trip latency.
With LinguaBridge, conversations across language barriers become natural and effortless, ideal for real-time interactions in professional, personal, and educational contexts.
This submission addresses the Real-Time Performance Voice Agent prompt with:
Multi-language Support: Seamless bidirectional translation across 12 languages, including English, Spanish, French, German, Chinese, and Arabic.
Core Problem Addressed
Effective communication across language barriers remains challenging in professional, educational, and personal contexts. LinguaBridge solves this by providing immediate, natural, and seamless voice translation, enabling effortless multilingual conversations in real-time.
Real-time cross-language voice translation with ultra-low latency.
Overview
LinguaBridge is a browser-based voice app that performs live bi-directional speech translation. Users select two languages (Speaker A and Speaker B). When a speaker talks, the app:
Transcribes speech with AssemblyAI's Universal-Streaming STT
Sends partial transcripts to Google Gemini 2.5 Flash for fast translation
Streams the translated output through Cartesia Sonic 2 or Sonic Turbo for ultra-fast TTS playback in the listener's language
All interactions are streamed with sub-300ms latency to enable fluid cross-language voice conversations.
Setup
1. Environment Variables
Create a .env.local file in the root directory with the following variables:
# AssemblyAI API Keys
# Get your API key from https://www.assemblyai.com/app/account
ASSEMBLYAI_API_KEY=your_assemblyai_key
# Google Gemini API Key
# Get your API key from https://aistudio.google.com/app/apikey
GEMINI_API_KEY=your_gemini_key
# Cartesia API Key
# Get your API key from https://cartesia.ai
CARTESIA_API_KEY=your_cartesia_key
1. Real-Time Speech Processing with AssemblyAI WebSocket API
LinguaBridge leverages AssemblyAI's WebSocket API for real-time speech-to-text transcription, ensuring ultra-low latency:
// lib/services/assemblyai-streaming.tsexportclassAssemblyAIStreamingService{privatews:WebSocket|null=null;privatecurrentLanguage='';privateonPartialCallback:((text:string)=>void)|null=null;asyncconnect(language:string,onPartialTranscript:(text:string)=>void,):Promise<void>{this.onPartialCallback=onPartialTranscript;this.currentLanguage=language;this.disconnect();// ensure clean connectionawaitthis.createWebSocketConnection();// connect to AssemblyAI WebSocket proxy}}
This implementation provides:
Real-time transcription with partial results.
Automatic reconnection handling.
Language-specific configurations.
Robust error handling.
2. Secure WebSocket Proxy for AssemblyAI Communication
To securely manage API keys and optimize performance, LinguaBridge implements a custom WebSocket proxy:
// server.jsconst{WebSocketServer}=require('ws');constWebSocket=require('ws');// WebSocket server setupwss.on('connection',(client)=>{constupstream=newWebSocket('wss://streaming.assemblyai.com/v3/ws?sample_rate=16000&format_turns=true',{headers:{authorization:ASSEMBLYAI_API_KEY}});// Forward messages from AssemblyAI to clientupstream.on('message',(data)=>{if (client.readyState===WebSocket.OPEN){client.send(data.toString());}});// Buffer audio data until readyclient.on('message',(data)=>{upstream.send(data);});});
This proxy ensures:
Secure server-side API key management.
Audio buffering during connection setup.
Reliable and scalable communication.
3. Optimized Audio Capture with Web Audio API
LinguaBridge uses the Web Audio API and AudioWorklet for high-quality audio processing optimized for speech recognition:
This looks cool.