How to create video transcription with ffmpeg and whisper
Mark Kop

Mark Kop @heymarkkop

About: A fullstack developer inspired by learning and sharing. (him/he)

Location:
Florianópolis, Santa Catarina, Brasil
Joined:
Aug 24, 2019

How to create video transcription with ffmpeg and whisper

Publish Date: Apr 9
5 1

Requirements

  • ffmpeg
  • whisper
  • Python 3.10+ (for Whisper)

Installation

macOS

# Install Homebrew if you don't have it
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

# Install ffmpeg
brew install ffmpeg

# Install Python (if needed)
brew install python

# Install Whisper
pip3 install --upgrade pip
pip3 install git+https://github.com/openai/whisper.git
Enter fullscreen mode Exit fullscreen mode

Windows

# Install Chocolatey if you don't have it
# Run in PowerShell as administrator:
Set-ExecutionPolicy Bypass -Scope Process -Force
[System.Net.ServicePointManager]::SecurityProtocol = [System.Net.ServicePointManager]::SecurityProtocol -bor 3072
iex ((New-Object System.Net.WebClient).DownloadString('https://community.chocolatey.org/install.ps1'))

# Install ffmpeg
choco install ffmpeg

# Install Python (from python.org)
# Make sure to check "Add Python to PATH" during installation

# Install Whisper
pip install -U openai-whisper
Enter fullscreen mode Exit fullscreen mode

Linux

# Install ffmpeg
sudo apt update && sudo apt install ffmpeg

# Install Python and pip
sudo apt install python3 python3-pip

# Install Whisper
pip3 install git+https://github.com/openai/whisper.git
Enter fullscreen mode Exit fullscreen mode

Transcription Steps

  1. Extract audio from video using ffmpeg
   ffmpeg -i input_video.mp4 -vn -acodec mp3 output.mp3
Enter fullscreen mode Exit fullscreen mode
  1. Transcribe audio with Whisper
   whisper output.mp3 --language English --model small --output_format txt
Enter fullscreen mode Exit fullscreen mode

Model Options

  • tiny: Fastest, lowest accuracy (~1GB RAM)
  • base: Fast, decent accuracy (~1GB RAM)
  • small: Balanced speed/accuracy (~2GB RAM)
  • medium: Good accuracy (~5GB RAM)
  • large: Best accuracy (~10GB RAM)

Output Formats

  • txt: Plain text transcript
  • srt: Standard subtitle format
  • vtt: Web Video Text Tracks format
  • json: Detailed JSON with timestamps

Additional Options

  • --task translate: Translates non-English audio to English
  • --language en: Specifies the source language (faster and more accurate)
  • --model: Selects the model size (tiny/base/small/medium/large)

Source: macos.gadgethacks.com
Source: dev.to

Comments 1 total

  • Danny S
    Danny SJun 18, 2025

    When using that Whisper, no need to input API Key?

Add comment