AI Tools #Whisper#transcription#speech to text

OpenAI Whisper: Local AI Transcription Guide for Privacy and Accuracy

Run OpenAI Whisper locally for private, accurate speech-to-text transcription of audio and video files — no cloud required.

6 min read

OpenAI Whisper is a powerful open-source speech recognition model that runs locally on your hardware — providing accurate transcription of audio and video in 99 languages with no cloud subscription, no privacy concerns, and no ongoing costs. Released openly by OpenAI, Whisper powers many commercial transcription services, but you can run it yourself for free.

Why Run Whisper Locally?

  • Privacy: Audio stays on your machine — critical for sensitive meetings, medical notes, or confidential content
  • No cost per minute: Cloud transcription services charge per minute of audio; local Whisper is free after hardware
  • Accuracy: Whisper large-v3 matches or beats commercial services for English, and excels in many other languages
  • Offline: Works without internet connection
  • Batch processing: Transcribe hundreds of files overnight without API rate limits

Model Sizes and Hardware Requirements

Whisper comes in several sizes with different speed/accuracy tradeoffs:

ModelParametersVRAM (GPU)RAM (CPU)Relative SpeedAccuracy
tiny39M~1 GB~1 GB32x realtimeGood
base74M~1 GB~1 GB16x realtimeBetter
small244M~2 GB~2 GB6x realtimeVery Good
medium769M~5 GB~5 GB2x realtimeExcellent
large-v31550M~10 GB~10 GB~1x realtimeBest

For most users: medium provides excellent accuracy with manageable resource use. large-v3 for maximum accuracy on a GPU with 10GB+ VRAM.

Installation

Python/pip Method

# Install Whisper
pip install openai-whisper --break-system-packages

# Install ffmpeg (required for audio processing)
# Ubuntu/Debian:
sudo apt install ffmpeg
# macOS:
brew install ffmpeg
# Windows: download from ffmpeg.org and add to PATH

faster-whisper uses CTranslate2 for 2-4x faster inference than the original:

pip install faster-whisper --break-system-packages

WhisperX (Speaker Diarization)

For transcription with multiple speakers identified:

pip install whisperx --break-system-packages

Basic Transcription

# Transcribe an audio file (auto-detects language)
whisper audio.mp3 --model medium

# Specify language explicitly (faster)
whisper meeting.mp4 --model medium --language en

# Output formats
whisper audio.mp3 --model medium --output_format txt
whisper audio.mp3 --model medium --output_format srt  # Subtitles
whisper audio.mp3 --model medium --output_format vtt  # Web subtitles
whisper audio.mp3 --model medium --output_format json  # Full JSON with timestamps

# Translate to English from another language
whisper spanish_audio.mp3 --model medium --task translate

Python API Usage

import whisper

# Load model (downloads on first run, cached afterward)
model = whisper.load_model("medium")

# Transcribe
result = model.transcribe("audio.mp3", language="en")

print(result["text"])

# Access segments with timestamps
for segment in result["segments"]:
    start = segment["start"]
    end = segment["end"]
    text = segment["text"]
    print(f"[{start:.1f}s - {end:.1f}s] {text}")

Faster-Whisper Python API

from faster_whisper import WhisperModel

# Initialize model (cpu or cuda)
model = WhisperModel("medium", device="cuda", compute_type="float16")

# Transcribe
segments, info = model.transcribe("audio.mp3", beam_size=5, language="en")

print(f"Detected language: {info.language}")
for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")

compute_type="float16" uses GPU tensor cores for faster processing on NVIDIA RTX cards.

GPU Acceleration

With an NVIDIA GPU (CUDA):

# Install CUDA-enabled torch (if not already installed)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Whisper automatically uses GPU if available
whisper audio.mp3 --model large-v3 --device cuda

Verify GPU usage:

import torch
print(torch.cuda.is_available())  # Should return True

On an RTX 4080 (16GB VRAM), large-v3 processes audio ~8-10x faster than realtime.

Batch Processing Script

Transcribe all audio/video files in a directory:

import whisper
import os
from pathlib import Path

model = whisper.load_model("medium")
input_dir = Path("./audio_files")
output_dir = Path("./transcriptions")
output_dir.mkdir(exist_ok=True)

extensions = {".mp3", ".mp4", ".wav", ".m4a", ".flac", ".ogg", ".mkv"}

for audio_file in input_dir.iterdir():
    if audio_file.suffix.lower() in extensions:
        print(f"Transcribing: {audio_file.name}")
        result = model.transcribe(str(audio_file), language="en")
        
        output_file = output_dir / (audio_file.stem + ".txt")
        output_file.write_text(result["text"])
        print(f"Saved: {output_file.name}")

print("Batch complete.")

WhisperX: Speaker Diarization

Identify who is speaking when in a multi-person recording:

whisperx meeting.mp4 --model medium --diarize --hf_token YOUR_HUGGINGFACE_TOKEN

Output identifies speakers:

SPEAKER_00: Hello, welcome to the meeting.
SPEAKER_01: Thanks for having me.
SPEAKER_00: Let's start with the agenda.

Register for a free HuggingFace token at huggingface.co for diarization.

GUI Frontends

For non-command-line users:

  • Whisper Desktop (Windows app): GUI wrapper for Whisper
  • MacWhisper (macOS): Native macOS app using Whisper
  • Subtitle Edit (Windows): Video subtitle editor with Whisper integration

These wrap the same models in user-friendly interfaces with progress indicators and format export options.

Accuracy Tips

  • Specify language: Adding --language en (or other) skips language detection and improves accuracy
  • Preprocess audio: Remove background noise with Audacity or ffmpeg — dramatic accuracy improvement on noisy recordings
  • Larger models: large-v3 is meaningfully more accurate than medium for non-native accents, technical vocabulary, and lower-quality audio
  • Word boost: Custom vocabulary improves accuracy for domain-specific terms (via initial_prompt parameter)

Whisper is one of the most useful local AI models available — run it once on a large batch of recordings and the time savings over manual transcription or paid services are substantial.

#privacy #local AI #AI #speech to text #transcription #Whisper