AI text-to-speech has crossed a quality threshold in 2026 where synthesized voices are difficult to distinguish from real recordings for casual listening. The landscape now splits cleanly into two camps: local models that run on your hardware with full privacy, and cloud APIs that deliver slightly higher quality with per-character pricing and zero hardware requirements.
Local TTS Options
Kokoro TTS
Kokoro is arguably the best local TTS model for quality-per-parameter in 2026. At only 82 million parameters, it produces studio-quality audio that rivals much larger models.
Key specs:
- 82M parameters, ~326MB model size
- 8 built-in voices (4 American English, 2 British English, 1 Spanish, 1 French)
- 50-200x real-time speed on modern hardware (CPU is fast enough)
- MIT licensed — completely free for commercial use
Installation:
pip install kokoro>=0.9.4 soundfile
For phoneme support (recommended for better accuracy):
# On Debian/Ubuntu
apt-get install espeak-ng
# On macOS
brew install espeak-ng
Basic usage:
from kokoro import KPipeline
import soundfile as sf
pipeline = KPipeline(lang_code='a') # 'a' = American English
# Generate audio
generator = pipeline(
"The future of AI text-to-speech is here, and it runs on your laptop.",
voice='af_heart', # American Female
speed=1.0
)
# Save to file
for i, (gs, ps, audio) in enumerate(generator):
sf.write(f'output_{i}.wav', audio, 24000)
print(f"Segment {i} saved")
Available voices: af_heart, af_bella, af_sarah, af_nicole, am_adam, am_michael, bf_emma, bf_isabella, bm_george, bm_lewis.
Coqui TTS (XTTS v2) — Voice Cloning
Coqui TTS with the XTTS v2 model supports zero-shot voice cloning — provide a 3-10 second audio reference clip, and it synthesizes new speech in that voice.
pip install TTS
Synthesize with a built-in voice:
tts --text "Hello, this is a test of XTTS voice synthesis." \
--model_name tts_models/multilingual/multi-dataset/xtts_v2 \
--language_idx en \
--out_path output.wav
Voice cloning from a reference audio:
from TTS.api import TTS
# Initialize XTTS v2
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2")
# Clone voice from reference clip (3-10 seconds recommended)
tts.tts_to_file(
text="This voice was cloned from a short audio sample using XTTS v2.",
speaker_wav="reference_voice.wav", # Your reference audio file
language="en",
file_path="cloned_output.wav"
)
Voice cloning requirements:
- Reference audio: 3-30 seconds of clean speech
- Noise-free recording strongly preferred
- WAV format, 22050Hz or 44100Hz sample rate
XTTS v2 is multilingual — it supports English, Spanish, French, German, Italian, Portuguese, Polish, Turkish, Russian, Dutch, Czech, Arabic, and Chinese.
Hardware: XTTS v2 runs comfortably on CPU (slower) or GPU. Generation is roughly 5-10x real-time on CPU, 20-50x real-time on an RTX 3060.
Piper TTS — Fast and Fully Offline
Piper is the fastest local TTS option — optimized for speed and minimal resource consumption. It’s the engine used by Home Assistant and many embedded applications.
pip install piper-tts
Download a voice model:
# Download English (US) Ryan voice
wget https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/ryan/high/en_US-ryan-high.onnx
wget https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/ryan/high/en_US-ryan-high.onnx.json
Generate audio:
echo "Piper TTS runs entirely offline at lightning speed." | \
piper --model en_US-ryan-high.onnx --output_file output.wav
from piper.voice import PiperVoice
voice = PiperVoice.load("en_US-ryan-high.onnx")
with open("output.wav", "wb") as wav_file:
with wave.open(wav_file, "w") as wav:
voice.synthesize("Fast offline TTS with Piper.", wav)
Piper voices are available in multiple quality levels: low, medium, high. The high models are ~65MB and produce good quality. Generation is real-time or faster on CPU — ideal for edge devices, Raspberry Pi, or applications requiring minimal latency.
Chatterbox TTS
Chatterbox from Resemble AI is a newer open-source entry notable for its emotion control capabilities. Released in 2025 under Apache 2.0 license:
pip install chatterbox-tts
from chatterbox.tts import ChatterboxTTS
model = ChatterboxTTS.from_pretrained(device="cuda")
wav = model.generate(
"This is a high-quality voice synthesis with natural prosody.",
exaggeration=0.5, # 0.0 = calm, 1.0 = expressive
cfg_weight=0.5 # Classifier-free guidance weight
)
Cloud TTS Options
ElevenLabs
ElevenLabs remains the quality leader for cloud TTS in 2026, particularly for voice cloning from short samples.
Pricing tiers (approximate 2026 rates):
- Free: 10,000 characters/month
- Starter: $5/mo, 30,000 characters
- Creator: $22/mo, 100,000 characters
- Pro: $99/mo, 500,000 characters
from elevenlabs import ElevenLabs
client = ElevenLabs(api_key="your-api-key")
audio = client.text_to_speech.convert(
voice_id="pNInz6obpgDQGcFmaJgB", # Adam voice
text="ElevenLabs produces the most natural-sounding AI voices available.",
model_id="eleven_turbo_v2_5",
output_format="mp3_44100_128"
)
with open("output.mp3", "wb") as f:
for chunk in audio:
f.write(chunk)
OpenAI TTS
OpenAI’s TTS API offers 6 voices at competitive quality and pricing:
from openai import OpenAI
client = OpenAI()
response = client.audio.speech.create(
model="tts-1-hd", # or "tts-1" for faster/cheaper
voice="alloy", # alloy, echo, fable, onyx, nova, shimmer
input="OpenAI's text-to-speech is simple, fast, and sounds great."
)
response.stream_to_file("output.mp3")
Pricing: tts-1 at $15/million characters, tts-1-hd at $30/million characters.
Audio Quality and Hardware Comparison
| Tool | Quality | Latency | Voice Cloning | Privacy | License |
|---|---|---|---|---|---|
| Kokoro TTS | Excellent | Very fast | No | Full | MIT |
| XTTS v2 | Very Good | Moderate | Yes | Full | CPML* |
| Piper TTS | Good | Real-time | No | Full | MIT |
| Chatterbox | Very Good | Moderate | Yes | Full | Apache 2.0 |
| ElevenLabs | Best | Cloud latency | Yes | Cloud | Proprietary |
| OpenAI TTS | Very Good | Cloud latency | No | Cloud | Proprietary |
*Coqui Public Model License — free for non-commercial, paid for commercial use
Local vs Cloud: Privacy Considerations
Local TTS means your text never leaves your machine. This is non-negotiable for:
- Healthcare or legal content with PII
- Confidential business communications
- Any jurisdiction with strict data residency requirements
Cloud TTS sends your text to the provider’s servers. Read privacy policies carefully: ElevenLabs and OpenAI both log API requests by default (with opt-out options for higher tiers).
For most developers, Kokoro is the right default: MIT licensed, CPU-fast, excellent quality, zero privacy concerns. Add XTTS v2 when you need voice cloning. Graduate to ElevenLabs when you need the absolute highest quality for customer-facing audio production.