Supercharge RealTime Multi-Language Translation with AssemblyAI
Abraham Dahunsi

Abraham Dahunsi @abraham_root

About: I am a software developer that loves solving problems by writing code. I also break down complex processes by writing technical guides.

Joined:
Oct 3, 2023

Supercharge RealTime Multi-Language Translation with AssemblyAI

Publish Date: Jul 28
30 2

This is a submission for the AssemblyAI Voice Agents Challenge

What I Built

I created LinguaBridge, a real-time bidirectional voice translation app. It utilizes AssemblyAI’s Universal-Streaming API for speech-to-text (STT), Google Gemini for instant translations, and Cartesia's high-performance text-to-speech (TTS) to deliver ultra-low-latency translations, targeting sub-300ms round-trip latency.

With LinguaBridge, conversations across language barriers become natural and effortless, ideal for real-time interactions in professional, personal, and educational contexts.

This submission addresses the Real-Time Performance Voice Agent prompt with:

  • Ultra-Low Latency: Sub-500ms round-trip voice translation latency.
  • Streaming Speech Recognition: Instantaneous processing of spoken input using AssemblyAI's Universal-Streaming API.
  • Immediate Translation: Real-time language translation via Google's Gemini Flash model.
  • Natural Voice Output: Instant text-to-speech synthesis powered by Cartesia TTS, ensuring natural conversational flow.
  • Multi-language Support: Seamless bidirectional translation across 12 languages, including English, Spanish, French, German, Chinese, and Arabic.

Core Problem Addressed

Effective communication across language barriers remains challenging in professional, educational, and personal contexts. LinguaBridge solves this by providing immediate, natural, and seamless voice translation, enabling effortless multilingual conversations in real-time.

Demo

Check out LinguaBridge live here:

LinguaBridge

Real-time cross-language voice translation with ultra-low latency.

Overview

LinguaBridge is a browser-based voice app that performs live bi-directional speech translation. Users select two languages (Speaker A and Speaker B). When a speaker talks, the app:

  1. Transcribes speech with AssemblyAI's Universal-Streaming STT
  2. Sends partial transcripts to Google Gemini 2.5 Flash for fast translation
  3. Streams the translated output through Cartesia Sonic 2 or Sonic Turbo for ultra-fast TTS playback in the listener's language

All interactions are streamed with sub-300ms latency to enable fluid cross-language voice conversations.

Setup

1. Environment Variables

Create a .env.local file in the root directory with the following variables:

# AssemblyAI API Keys
# Get your API key from https://www.assemblyai.com/app/account
ASSEMBLYAI_API_KEY=your_assemblyai_key

# Google Gemini API Key
# Get your API key from https://aistudio.google.com/app/apikey
GEMINI_API_KEY=your_gemini_key

# Cartesia API Key
# Get your API key from https://cartesia.ai
CARTESIA_API_KEY=your_cartesia_key

2. Install Dependencies

npm install
Enter fullscreen mode Exit fullscreen mode

Running the Application

LinguaBridge requires two processes…

Demo Video

Screenshots

  • Landing Area

ILanding area 1

Landing Area 2

  • Language Selection

Language selection 1

Language selection 2

  • Voice Selection

Voice Selection

  • Live Transcription Area

Live Transcription Area

Technical Implementation & AssemblyAI Integration

1. Real-Time Speech Processing with AssemblyAI WebSocket API

LinguaBridge leverages AssemblyAI's WebSocket API for real-time speech-to-text transcription, ensuring ultra-low latency:

// lib/services/assemblyai-streaming.ts
export class AssemblyAIStreamingService {
  private ws: WebSocket | null = null;
  private currentLanguage = '';
  private onPartialCallback: ((text: string) => void) | null = null;

  async connect(
    language: string,
    onPartialTranscript: (text: string) => void,
  ): Promise<void> {
    this.onPartialCallback = onPartialTranscript;
    this.currentLanguage = language;

    this.disconnect(); // ensure clean connection
    await this.createWebSocketConnection(); // connect to AssemblyAI WebSocket proxy
  }
}
Enter fullscreen mode Exit fullscreen mode

This implementation provides:

  • Real-time transcription with partial results.
  • Automatic reconnection handling.
  • Language-specific configurations.
  • Robust error handling.

2. Secure WebSocket Proxy for AssemblyAI Communication

To securely manage API keys and optimize performance, LinguaBridge implements a custom WebSocket proxy:

// server.js
const { WebSocketServer } = require('ws');
const WebSocket = require('ws');

// WebSocket server setup
wss.on('connection', (client) => {
  const upstream = new WebSocket(
    'wss://streaming.assemblyai.com/v3/ws?sample_rate=16000&format_turns=true',
    { headers: { authorization: ASSEMBLYAI_API_KEY } }
  );

  // Forward messages from AssemblyAI to client
  upstream.on('message', (data) => {
    if (client.readyState === WebSocket.OPEN) {
      client.send(data.toString());
    }
  });

  // Buffer audio data until ready
  client.on('message', (data) => {
    upstream.send(data);
  });
});
Enter fullscreen mode Exit fullscreen mode

This proxy ensures:

  • Secure server-side API key management.
  • Audio buffering during connection setup.
  • Reliable and scalable communication.

3. Optimized Audio Capture with Web Audio API

LinguaBridge uses the Web Audio API and AudioWorklet for high-quality audio processing optimized for speech recognition:

// lib/services/audio-processor.ts
export class AudioProcessor {
  private audioContext: AudioContext | null = null;

  async startCapture(onAudioData: (data: ArrayBuffer) => void): Promise<void> {
    const audioContext = await this.initializeAudioContext();
    const stream = await navigator.mediaDevices.getUserMedia({
      audio: { channelCount: 1, echoCancellation: true, noiseSuppression: true }
    });

    const source = audioContext.createMediaStreamSource(stream);
    await audioContext.audioWorklet.addModule('/audio-processor.js');

    const workletNode = new AudioWorkletNode(audioContext, 'audio-processor');
    workletNode.port.onmessage = (event) => {
      if (event.data.audioData) {
        onAudioData(event.data.audioData);
      }
    };
    source.connect(workletNode);
  }
}
Enter fullscreen mode Exit fullscreen mode

This ensures:

  • Efficient audio capture at 16kHz mono.
  • Built-in noise suppression and echo cancellation.
  • Cross-browser compatibility.

4. Real-Time Translation via Google Gemini

LinguaBridge integrates Google's Gemini 2.5 Flash for ultra-fast translations:

// lib/services/gemini-translation.ts
export class GeminiTranslationService {
  private model: any = null;

  constructor(apiKey: string) {
    const genAI = new GoogleGenerativeAI(apiKey);
    this.model = genAI.getGenerativeModel({
      model: 'gemini-2.5-flash',
      generationConfig: { temperature: 0.1, maxOutputTokens: 1024 }
    });
  }

  async translateStream(text: string, sourceLang: string, targetLang: string): Promise<string> {
    // optimized translation logic
  }
}
Enter fullscreen mode Exit fullscreen mode

This translation approach provides:

  • <150ms latency.
  • Intelligent caching and optimized prompts.
  • Robust error handling.

5. High-Speed Text-to-Speech with Cartesia

Cartesia TTS is integrated for fast, natural speech synthesis:

// lib/services/cartesia-tts.ts
export class CartesiaTTSService {
  private ws: WebSocket | null = null;

  async streamText(text: string, voiceId: string): Promise<void> {
    if (!this.ws || this.ws.readyState !== WebSocket.OPEN) {
      await this.connectWebSocket(voiceId);
    }

    this.ws.send(JSON.stringify({
      model_id: 'sonic-turbo',
      voice: { mode: 'id', id: voiceId },
      transcript: text,
      output_format: { encoding: 'pcm_s16le', sample_rate: 16000 }
    }));
  }
}
Enter fullscreen mode Exit fullscreen mode

This implementation ensures:

  • Low-latency speech synthesis.
  • Seamless audio playback.
  • Multiple voice and language support.

6. End-to-End Real-Time Translation Pipeline

LinguaBridge orchestrates all services into a seamless, real-time pipeline:

// hooks/use-translation.ts
export const useTranslation = () => {
  const startTranslation = async (sourceLang, targetLang, voice) => {
    await audioProcessor.startCapture((audioData) => {
      assemblyAI.sendAudioData(audioData);
    });

    assemblyAI.connect(sourceLang, (transcript) => {
      gemini.translateStream(transcript, sourceLang, targetLang)
        .then((translation) => cartesia.streamText(translation, voice));
    });
  };
};
Enter fullscreen mode Exit fullscreen mode

This coordination ensures:

  • Seamless real-time interaction.
  • Efficient resource management.
  • Dynamic error handling.

Integration Architecture

Complete LinguaBridge workflow:

Audio Capture → WebSocket Proxy → AssemblyAI STT → Gemini Translation → Cartesia TTS → Audio Playback
Enter fullscreen mode Exit fullscreen mode

This architecture achieves:

  • Sub-500ms end-to-end latency.
  • High-performance scalability.
  • Robust and maintainable integration.

7. Multi-Language Real-Time Processing

LinguaBridge supports 12 languages with streamlined switching:

// app/page.tsx
const SUPPORTED_LANGUAGES = [
  { code: 'en', name: 'English' },
  { code: 'es', name: 'Spanish' },
  { code: 'fr', name: 'French' },
  { code: 'de', name: 'German' },
  { code: 'it', name: 'Italian' },
  { code: 'pt', name: 'Portuguese' },
  { code: 'ru', name: 'Russian' },
  { code: 'ja', name: 'Japanese' },
  { code: 'ko', name: 'Korean' },
  { code: 'zh', name: 'Chinese (Mandarin)' },
  { code: 'ar', name: 'Arabic' },
  { code: 'hi', name: 'Hindi' },
];
Enter fullscreen mode Exit fullscreen mode

This ensures:

  • Bidirectional multilingual translation.
  • Dynamic language selection.
  • Integrated voice profile management.

Comments 2 total

Add comment