OpenAI Whisper is a powerful open-source speech recognition model that runs locally on your hardware — providing accurate transcription of audio and video in 99 languages with no cloud subscription, no privacy concerns, and no ongoing costs. Released openly by OpenAI, Whisper powers many commercial transcription services, but you can run it yourself for free.

Why Run Whisper Locally?

Privacy: Audio stays on your machine — critical for sensitive meetings, medical notes, or confidential content
No cost per minute: Cloud transcription services charge per minute of audio; local Whisper is free after hardware
Accuracy: Whisper large-v3 matches or beats commercial services for English, and excels in many other languages
Offline: Works without internet connection
Batch processing: Transcribe hundreds of files overnight without API rate limits

Model Sizes and Hardware Requirements

Whisper comes in several sizes with different speed/accuracy tradeoffs:

Model	Parameters	VRAM (GPU)	RAM (CPU)	Relative Speed	Accuracy
tiny	39M	~1 GB	~1 GB	32x realtime	Good
base	74M	~1 GB	~1 GB	16x realtime	Better
small	244M	~2 GB	~2 GB	6x realtime	Very Good
medium	769M	~5 GB	~5 GB	2x realtime	Excellent
large-v3	1550M	~10 GB	~10 GB	~1x realtime	Best

For most users: medium provides excellent accuracy with manageable resource use. large-v3 for maximum accuracy on a GPU with 10GB+ VRAM.

Installation

Python/pip Method

# Install Whisper
pip install openai-whisper --break-system-packages

# Install ffmpeg (required for audio processing)
# Ubuntu/Debian:
sudo apt install ffmpeg
# macOS:
brew install ffmpeg
# Windows: download from ffmpeg.org and add to PATH

Faster-Whisper (Recommended for Speed)

faster-whisper uses CTranslate2 for 2-4x faster inference than the original:

pip install faster-whisper --break-system-packages

WhisperX (Speaker Diarization)

For transcription with multiple speakers identified:

pip install whisperx --break-system-packages

Basic Transcription

# Transcribe an audio file (auto-detects language)
whisper audio.mp3 --model medium

# Specify language explicitly (faster)
whisper meeting.mp4 --model medium --language en

# Output formats
whisper audio.mp3 --model medium --output_format txt
whisper audio.mp3 --model medium --output_format srt  # Subtitles
whisper audio.mp3 --model medium --output_format vtt  # Web subtitles
whisper audio.mp3 --model medium --output_format json  # Full JSON with timestamps

# Translate to English from another language
whisper spanish_audio.mp3 --model medium --task translate

Python API Usage

import whisper

# Load model (downloads on first run, cached afterward)
model = whisper.load_model("medium")

# Transcribe
result = model.transcribe("audio.mp3", language="en")

print(result["text"])

# Access segments with timestamps
for segment in result["segments"]:
    start = segment["start"]
    end = segment["end"]
    text = segment["text"]
    print(f"[{start:.1f}s - {end:.1f}s] {text}")

Faster-Whisper Python API

from faster_whisper import WhisperModel

# Initialize model (cpu or cuda)
model = WhisperModel("medium", device="cuda", compute_type="float16")

# Transcribe
segments, info = model.transcribe("audio.mp3", beam_size=5, language="en")

print(f"Detected language: {info.language}")
for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")

compute_type="float16" uses GPU tensor cores for faster processing on NVIDIA RTX cards.

GPU Acceleration

With an NVIDIA GPU (CUDA):

# Install CUDA-enabled torch (if not already installed)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Whisper automatically uses GPU if available
whisper audio.mp3 --model large-v3 --device cuda

Verify GPU usage:

import torch
print(torch.cuda.is_available())  # Should return True

On an RTX 4080 (16GB VRAM), large-v3 processes audio ~8-10x faster than realtime.

Batch Processing Script

Transcribe all audio/video files in a directory:

import whisper
import os
from pathlib import Path

model = whisper.load_model("medium")
input_dir = Path("./audio_files")
output_dir = Path("./transcriptions")
output_dir.mkdir(exist_ok=True)

extensions = {".mp3", ".mp4", ".wav", ".m4a", ".flac", ".ogg", ".mkv"}

for audio_file in input_dir.iterdir():
    if audio_file.suffix.lower() in extensions:
        print(f"Transcribing: {audio_file.name}")
        result = model.transcribe(str(audio_file), language="en")
        
        output_file = output_dir / (audio_file.stem + ".txt")
        output_file.write_text(result["text"])
        print(f"Saved: {output_file.name}")

print("Batch complete.")

WhisperX: Speaker Diarization

Identify who is speaking when in a multi-person recording:

whisperx meeting.mp4 --model medium --diarize --hf_token YOUR_HUGGINGFACE_TOKEN

Output identifies speakers:

SPEAKER_00: Hello, welcome to the meeting.
SPEAKER_01: Thanks for having me.
SPEAKER_00: Let's start with the agenda.

GUI Frontends

For non-command-line users:

Whisper Desktop (Windows app): GUI wrapper for Whisper
MacWhisper (macOS): Native macOS app using Whisper
Subtitle Edit (Windows): Video subtitle editor with Whisper integration

These wrap the same models in user-friendly interfaces with progress indicators and format export options.

Accuracy Tips

Specify language: Adding --language en (or other) skips language detection and improves accuracy
Preprocess audio: Remove background noise with Audacity or ffmpeg — dramatic accuracy improvement on noisy recordings
Larger models: large-v3 is meaningfully more accurate than medium for non-native accents, technical vocabulary, and lower-quality audio
Word boost: Custom vocabulary improves accuracy for domain-specific terms (via initial_prompt parameter)

Whisper is one of the most useful local AI models available — run it once on a large batch of recordings and the time savings over manual transcription or paid services are substantial.