Who Said What? Build a Smart Transcriber Agent with AWS & LangChain

Amazon Transcribe provides automatic speech recognition (ASR) with support for speaker diarization—the process of labeling individual speakers in audio recordings.

🛠️ Prerequisites

✅ AWS Account
✅ AWS CLI or SDK installed and configured
✅ An S3 bucket to store audio files
✅ Audio file in supported format (e.g., .wav, .mp3, .flac)

📤 Step 1: Upload Audio to Amazon S3

aws s3 cp your_audio_file.wav s3://your-bucket-name/

🧠 Step 2: Start Transcription Job with Speaker Diarization Enabled

aws transcribe start-transcription-job \
  --transcription-job-name "diarization-job-001" \
  --language-code "en-US" \
  --media MediaFileUri=s3://your-bucket-name/your_audio_file.wav \
  --output-bucket-name your-output-bucket \
  --settings ShowSpeakerLabels=true,MaxSpeakerLabels=5

📌 ShowSpeakerLabels=true enables speaker diarization

📌 MaxSpeakerLabels=5 sets an upper limit on the number of speakers

⏳ Step 3: Check Transcription Job Status

aws transcribe get-transcription-job \
  --transcription-job-name "diarization-job-001"

Once the job status becomes COMPLETED, the transcription JSON is available in your S3 output bucket.

📄 Step 4: View Diarized Transcription Output

Sample excerpt from the output JSON:

{
  "results": {
    "speaker_labels": {
      "segments": [
        {
          "speaker_label": "spk_0",
          "start_time": "0.0",
          "end_time": "2.5"
        }
      ]
    },
    "items": [
      {
        "start_time": "0.0",
        "end_time": "0.7",
        "alternatives": [
          {
            "confidence": "1.0",
            "content": "Hello"
          }
        ],
        "type": "pronunciation",
        "speaker_label": "spk_0"
      }
    ]
  }
}

🐍 Optional: Python Script to Start Job

import boto3

transcribe = boto3.client('transcribe')

transcribe.start_transcription_job(
    TranscriptionJobName='diarization-job-001',
    LanguageCode='en-US',
    Media={'MediaFileUri': 's3://your-bucket-name/your_audio_file.wav'},
    OutputBucketName='your-output-bucket',
    Settings={
        'ShowSpeakerLabels': True,
        'MaxSpeakerLabels': 5
    }
)

📝 Optional: Convert Output to Readable Text

Example post-processed output:

Speaker 1: Hello, how are you?
Speaker 2: I'm doing well, thanks. And you?
Speaker 1: I'm great!

You can write a script to process the JSON and reformat it into readable dialogue using speaker labels and timestamps.

🧩 Notes

Speaker Diarization is only supported in batch mode, not real-time.
The accuracy depends on the quality of the audio and clarity of speaker voices.
Diarization is supported for select languages (e.g., English).

📚 Resources

🤖 Bonus: Create a Transcriber Agent using LangChain and AWS

You can automate the transcription and diarization process using a LangChain agent!

🧩 Requirements

langchain
boto3
openai (for natural language post-processing or QA)

📦 Install Dependencies

pip install langchain boto3 openai

🤖 Sample LangChain Agent Setup

from langchain.agents import initialize_agent, Tool
from langchain.llms import OpenAI
import boto3

# Tool to trigger transcription job
def start_transcription_job(file_uri):
    transcribe = boto3.client('transcribe')
    response = transcribe.start_transcription_job(
        TranscriptionJobName="LangChainDiarizationJob",
        LanguageCode="en-US",
        Media={'MediaFileUri': file_uri},
        OutputBucketName='your-output-bucket',
        Settings={
            'ShowSpeakerLabels': True,
            'MaxSpeakerLabels': 5
        }
    )
    return "Started transcription job: LangChainDiarizationJob"

# Register tool with LangChain
tools = [
    Tool(
        name="AWSTranscribeDiarizer",
        func=start_transcription_job,
        description="Start a diarization transcription job using AWS Transcribe given an S3 audio URL"
    )
]

# Initialize agent with OpenAI and tools
llm = OpenAI(temperature=0)
agent = initialize_agent(tools, llm, agent="zero-shot-react-description", verbose=True)

# Run agent with a prompt
agent.run("Transcribe the file at s3://your-bucket-name/your_audio.wav with speaker labels")

🧠 What This Agent Does

Accepts a prompt to trigger AWS Transcribe
Starts diarization on a given audio URL
Can be extended to fetch and format output, or even generate summaries!

Chandrani Mukherjee @moni121189