OpenAI Whisper is a powerful open-source speech recognition model that runs locally on your hardware — providing accurate transcription of audio and video in 99 languages with no cloud subscription, no privacy concerns, and no ongoing costs. Released openly by OpenAI, Whisper powers many commercial transcription services, but you can run it yourself for free.
Why Run Whisper Locally?
- Privacy: Audio stays on your machine — critical for sensitive meetings, medical notes, or confidential content
- No cost per minute: Cloud transcription services charge per minute of audio; local Whisper is free after hardware
- Accuracy: Whisper large-v3 matches or beats commercial services for English, and excels in many other languages
- Offline: Works without internet connection
- Batch processing: Transcribe hundreds of files overnight without API rate limits
Model Sizes and Hardware Requirements
Whisper comes in several sizes with different speed/accuracy tradeoffs:
| Model | Parameters | VRAM (GPU) | RAM (CPU) | Relative Speed | Accuracy |
|---|---|---|---|---|---|
| tiny | 39M | ~1 GB | ~1 GB | 32x realtime | Good |
| base | 74M | ~1 GB | ~1 GB | 16x realtime | Better |
| small | 244M | ~2 GB | ~2 GB | 6x realtime | Very Good |
| medium | 769M | ~5 GB | ~5 GB | 2x realtime | Excellent |
| large-v3 | 1550M | ~10 GB | ~10 GB | ~1x realtime | Best |
For most users: medium provides excellent accuracy with manageable resource use. large-v3 for maximum accuracy on a GPU with 10GB+ VRAM.
Installation
Python/pip Method
# Install Whisper
pip install openai-whisper --break-system-packages
# Install ffmpeg (required for audio processing)
# Ubuntu/Debian:
sudo apt install ffmpeg
# macOS:
brew install ffmpeg
# Windows: download from ffmpeg.org and add to PATH
Faster-Whisper (Recommended for Speed)
faster-whisper uses CTranslate2 for 2-4x faster inference than the original:
pip install faster-whisper --break-system-packages
WhisperX (Speaker Diarization)
For transcription with multiple speakers identified:
pip install whisperx --break-system-packages
Basic Transcription
# Transcribe an audio file (auto-detects language)
whisper audio.mp3 --model medium
# Specify language explicitly (faster)
whisper meeting.mp4 --model medium --language en
# Output formats
whisper audio.mp3 --model medium --output_format txt
whisper audio.mp3 --model medium --output_format srt # Subtitles
whisper audio.mp3 --model medium --output_format vtt # Web subtitles
whisper audio.mp3 --model medium --output_format json # Full JSON with timestamps
# Translate to English from another language
whisper spanish_audio.mp3 --model medium --task translate
Python API Usage
import whisper
# Load model (downloads on first run, cached afterward)
model = whisper.load_model("medium")
# Transcribe
result = model.transcribe("audio.mp3", language="en")
print(result["text"])
# Access segments with timestamps
for segment in result["segments"]:
start = segment["start"]
end = segment["end"]
text = segment["text"]
print(f"[{start:.1f}s - {end:.1f}s] {text}")
Faster-Whisper Python API
from faster_whisper import WhisperModel
# Initialize model (cpu or cuda)
model = WhisperModel("medium", device="cuda", compute_type="float16")
# Transcribe
segments, info = model.transcribe("audio.mp3", beam_size=5, language="en")
print(f"Detected language: {info.language}")
for segment in segments:
print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")
compute_type="float16" uses GPU tensor cores for faster processing on NVIDIA RTX cards.
GPU Acceleration
With an NVIDIA GPU (CUDA):
# Install CUDA-enabled torch (if not already installed)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# Whisper automatically uses GPU if available
whisper audio.mp3 --model large-v3 --device cuda
Verify GPU usage:
import torch
print(torch.cuda.is_available()) # Should return True
On an RTX 4080 (16GB VRAM), large-v3 processes audio ~8-10x faster than realtime.
Batch Processing Script
Transcribe all audio/video files in a directory:
import whisper
import os
from pathlib import Path
model = whisper.load_model("medium")
input_dir = Path("./audio_files")
output_dir = Path("./transcriptions")
output_dir.mkdir(exist_ok=True)
extensions = {".mp3", ".mp4", ".wav", ".m4a", ".flac", ".ogg", ".mkv"}
for audio_file in input_dir.iterdir():
if audio_file.suffix.lower() in extensions:
print(f"Transcribing: {audio_file.name}")
result = model.transcribe(str(audio_file), language="en")
output_file = output_dir / (audio_file.stem + ".txt")
output_file.write_text(result["text"])
print(f"Saved: {output_file.name}")
print("Batch complete.")
WhisperX: Speaker Diarization
Identify who is speaking when in a multi-person recording:
whisperx meeting.mp4 --model medium --diarize --hf_token YOUR_HUGGINGFACE_TOKEN
Output identifies speakers:
SPEAKER_00: Hello, welcome to the meeting.
SPEAKER_01: Thanks for having me.
SPEAKER_00: Let's start with the agenda.
Register for a free HuggingFace token at huggingface.co for diarization.
GUI Frontends
For non-command-line users:
- Whisper Desktop (Windows app): GUI wrapper for Whisper
- MacWhisper (macOS): Native macOS app using Whisper
- Subtitle Edit (Windows): Video subtitle editor with Whisper integration
These wrap the same models in user-friendly interfaces with progress indicators and format export options.
Accuracy Tips
- Specify language: Adding
--language en(or other) skips language detection and improves accuracy - Preprocess audio: Remove background noise with Audacity or ffmpeg — dramatic accuracy improvement on noisy recordings
- Larger models:
large-v3is meaningfully more accurate thanmediumfor non-native accents, technical vocabulary, and lower-quality audio - Word boost: Custom vocabulary improves accuracy for domain-specific terms (via initial_prompt parameter)
Whisper is one of the most useful local AI models available — run it once on a large batch of recordings and the time savings over manual transcription or paid services are substantial.