DeepSeek R1 arrived in early 2025 and made waves by matching or exceeding OpenAI’s o1 on reasoning benchmarks — at a fraction of the training cost and with open weights. Running it locally is straightforward, and the distilled variants bring serious reasoning capability to consumer hardware.
The DeepSeek R1 Model Family
DeepSeek released the R1 series in multiple configurations:
| Model | Parameters | VRAM Required (Q4) | Notes |
|---|---|---|---|
| DeepSeek-R1-Distill-Qwen-1.5B | 1.5B | ~2GB | Runs on integrated graphics |
| DeepSeek-R1-Distill-Qwen-7B | 7B | ~5GB | Good quality, any modern GPU |
| DeepSeek-R1-Distill-Llama-8B | 8B | ~6GB | Llama 3 architecture base |
| DeepSeek-R1-Distill-Qwen-14B | 14B | ~10GB | Best distill for 12GB VRAM |
| DeepSeek-R1-Distill-Qwen-32B | 32B | ~22GB | High quality, 24GB VRAM |
| DeepSeek-R1-Distill-Llama-70B | 70B | ~45GB | Near-full R1 quality |
| DeepSeek-R1 (full) | 671B | ~400GB+ | Requires data center hardware |
Distilled models are trained using DeepSeek R1’s reasoning traces as training data for smaller architectures. The R1-Distill-Qwen-14B in particular punches far above its weight — delivering reasoning quality that would have required a 70B model just a year earlier.
R1 vs R1-Zero
- DeepSeek-R1-Zero: trained purely with reinforcement learning from scratch, no supervised fine-tuning. It developed reasoning spontaneously. Quirky output formatting.
- DeepSeek-R1: R1-Zero plus additional supervised fine-tuning on human-curated data. Cleaner outputs, better instruction following.
For practical use, always prefer R1 over R1-Zero. R1-Zero is primarily interesting for research into emergent reasoning.
Running with Ollama
Ollama is the easiest path to local DeepSeek R1. Install from ollama.com if you haven’t:
# macOS/Linux install
curl -fsSL https://ollama.com/install.sh | sh
# Windows: download installer from ollama.com
Pull and run DeepSeek R1 distilled models:
# 8B model — good for most hardware (6GB+ VRAM)
ollama pull deepseek-r1:8b
ollama run deepseek-r1:8b
# 14B model — excellent quality/size ratio
ollama pull deepseek-r1:14b
ollama run deepseek-r1:14b
# 32B model — for 24GB VRAM GPUs
ollama pull deepseek-r1:32b
ollama run deepseek-r1:32b
# 70B model — requires 48GB+ VRAM or multi-GPU
ollama pull deepseek-r1:70b
Ollama automatically uses GPU acceleration if available and falls back to CPU inference.
Querying via Ollama API
curl http://localhost:11434/api/chat -d '{
"model": "deepseek-r1:14b",
"messages": [
{
"role": "user",
"content": "Solve this step by step: A train leaves Chicago at 60mph. Another leaves New York at 80mph. Cities are 800 miles apart. When do they meet?"
}
]
}'
import ollama
response = ollama.chat(
model='deepseek-r1:14b',
messages=[{
'role': 'user',
'content': 'Write a Python function to find all prime numbers up to n using the Sieve of Eratosthenes'
}]
)
print(response['message']['content'])
Understanding the Reasoning Chain Output
DeepSeek R1’s signature feature is its chain-of-thought reasoning, enclosed in <think> tags. The model reasons step-by-step before giving its final answer:
<think>
Let me work through this problem systematically.
The trains are traveling toward each other, so their speeds add: 60 + 80 = 140 mph combined closing speed.
Distance = 800 miles.
Time = Distance / Speed = 800 / 140 = 5.71 hours.
That's 5 hours and approximately 43 minutes.
Let me verify: in 5.71 hours, train 1 travels 60 × 5.71 = 342.6 miles. Train 2 travels 80 × 5.71 = 456.8 miles. Total: 342.6 + 456.8 = 799.4 ≈ 800 miles. ✓
</think>
The trains will meet approximately **5 hours and 43 minutes** after departing.
This reasoning chain is what makes R1 excel at math, coding, and logic. The model literally shows its work.
Some tools (like Open WebUI) can collapse the <think> block by default so you only see the final answer, with an option to expand the reasoning.
Running in LM Studio
LM Studio provides a GUI for running GGUF models locally:
- Download LM Studio from
lmstudio.ai - Open the Discover tab
- Search for
deepseek-r1 - Select your preferred quantization:
Q4_K_M— best balance of quality and VRAMQ5_K_M— higher quality, more VRAMQ8_0— near-lossless, double the VRAM
- Click Download and wait for the model to download
- Switch to the Chat tab and load the model
LM Studio also provides a local server with OpenAI-compatible endpoints at http://localhost:1234/v1.
Manual GGUF Download
For fine-grained control over quantization, download directly from Hugging Face:
# bartowski's quantizations are community-recommended
huggingface-cli download bartowski/DeepSeek-R1-Distill-Qwen-14B-GGUF \
DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf \
--local-dir ./models
# Run with llama.cpp
./llama-cli \
-m ./models/DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf \
-ngl 999 \
-p "<|im_start|>user\nExplain the P vs NP problem<|im_end|>\n<|im_start|>assistant\n" \
-n 1000
Hardware Requirements Per Model Size
| Model | Minimum VRAM | Recommended Setup | Tokens/Sec (approx) |
|---|---|---|---|
| R1-Distill 1.5B | 2GB | Any GPU | 60-100 tok/s |
| R1-Distill 7B | 6GB | RTX 3060 12GB | 30-60 tok/s |
| R1-Distill 8B | 6GB | RTX 3060 12GB | 25-50 tok/s |
| R1-Distill 14B | 10GB | RTX 4070 12GB | 20-40 tok/s |
| R1-Distill 32B | 22GB | RTX 3090/4090 | 10-25 tok/s |
| R1-Distill 70B | 45GB | 2x RTX 4090 | 5-12 tok/s |
Best Use Cases for DeepSeek R1
DeepSeek R1’s reasoning-first training makes it exceptional at:
Coding tasks:
- Debugging complex logic errors
- Writing algorithms with correctness requirements
- Code review and refactoring suggestions
Mathematics:
- Step-by-step problem solving
- Proof verification
- Statistical analysis and formula derivation
Logic and reasoning:
- Multi-step deduction problems
- Constraint satisfaction
- Argument analysis and critique
Compared to Llama 3 and Mistral, R1 distills consistently outperform on MATH, HumanEval, and GPQA benchmarks at equivalent parameter counts. For creative writing or casual chat, the difference is minimal. For anything requiring careful reasoning, R1 is the better choice.