DeepSeek R1 arrived in early 2025 and made waves by matching or exceeding OpenAI’s o1 on reasoning benchmarks — at a fraction of the training cost and with open weights. Running it locally is straightforward, and the distilled variants bring serious reasoning capability to consumer hardware.

The DeepSeek R1 Model Family

DeepSeek released the R1 series in multiple configurations:

Model	Parameters	VRAM Required (Q4)	Notes
DeepSeek-R1-Distill-Qwen-1.5B	1.5B	~2GB	Runs on integrated graphics
DeepSeek-R1-Distill-Qwen-7B	7B	~5GB	Good quality, any modern GPU
DeepSeek-R1-Distill-Llama-8B	8B	~6GB	Llama 3 architecture base
DeepSeek-R1-Distill-Qwen-14B	14B	~10GB	Best distill for 12GB VRAM
DeepSeek-R1-Distill-Qwen-32B	32B	~22GB	High quality, 24GB VRAM
DeepSeek-R1-Distill-Llama-70B	70B	~45GB	Near-full R1 quality
DeepSeek-R1 (full)	671B	~400GB+	Requires data center hardware

Distilled models are trained using DeepSeek R1’s reasoning traces as training data for smaller architectures. The R1-Distill-Qwen-14B in particular punches far above its weight — delivering reasoning quality that would have required a 70B model just a year earlier.

R1 vs R1-Zero

DeepSeek-R1-Zero: trained purely with reinforcement learning from scratch, no supervised fine-tuning. It developed reasoning spontaneously. Quirky output formatting.
DeepSeek-R1: R1-Zero plus additional supervised fine-tuning on human-curated data. Cleaner outputs, better instruction following.

For practical use, always prefer R1 over R1-Zero. R1-Zero is primarily interesting for research into emergent reasoning.

Running with Ollama

Ollama is the easiest path to local DeepSeek R1. Install from ollama.com if you haven’t:

# macOS/Linux install
curl -fsSL https://ollama.com/install.sh | sh

# Windows: download installer from ollama.com

Pull and run DeepSeek R1 distilled models:

# 8B model — good for most hardware (6GB+ VRAM)
ollama pull deepseek-r1:8b
ollama run deepseek-r1:8b

# 14B model — excellent quality/size ratio
ollama pull deepseek-r1:14b
ollama run deepseek-r1:14b

# 32B model — for 24GB VRAM GPUs
ollama pull deepseek-r1:32b
ollama run deepseek-r1:32b

# 70B model — requires 48GB+ VRAM or multi-GPU
ollama pull deepseek-r1:70b

Ollama automatically uses GPU acceleration if available and falls back to CPU inference.

Querying via Ollama API

curl http://localhost:11434/api/chat -d '{
  "model": "deepseek-r1:14b",
  "messages": [
    {
      "role": "user",
      "content": "Solve this step by step: A train leaves Chicago at 60mph. Another leaves New York at 80mph. Cities are 800 miles apart. When do they meet?"
    }
  ]
}'

import ollama

response = ollama.chat(
    model='deepseek-r1:14b',
    messages=[{
        'role': 'user',
        'content': 'Write a Python function to find all prime numbers up to n using the Sieve of Eratosthenes'
    }]
)
print(response['message']['content'])

Understanding the Reasoning Chain Output

DeepSeek R1’s signature feature is its chain-of-thought reasoning, enclosed in <think> tags. The model reasons step-by-step before giving its final answer:

<think>
Let me work through this problem systematically.

The trains are traveling toward each other, so their speeds add: 60 + 80 = 140 mph combined closing speed.
Distance = 800 miles.
Time = Distance / Speed = 800 / 140 = 5.71 hours.
That's 5 hours and approximately 43 minutes.

Let me verify: in 5.71 hours, train 1 travels 60 × 5.71 = 342.6 miles. Train 2 travels 80 × 5.71 = 456.8 miles. Total: 342.6 + 456.8 = 799.4 ≈ 800 miles. ✓
</think>

The trains will meet approximately **5 hours and 43 minutes** after departing.

This reasoning chain is what makes R1 excel at math, coding, and logic. The model literally shows its work.

Some tools (like Open WebUI) can collapse the <think> block by default so you only see the final answer, with an option to expand the reasoning.

Running in LM Studio

LM Studio provides a GUI for running GGUF models locally:

Download LM Studio from lmstudio.ai
Open the Discover tab
Search for deepseek-r1
Select your preferred quantization:
- Q4_K_M — best balance of quality and VRAM
- Q5_K_M — higher quality, more VRAM
- Q8_0 — near-lossless, double the VRAM
Click Download and wait for the model to download
Switch to the Chat tab and load the model

LM Studio also provides a local server with OpenAI-compatible endpoints at http://localhost:1234/v1.

Manual GGUF Download

For fine-grained control over quantization, download directly from Hugging Face:

# bartowski's quantizations are community-recommended
huggingface-cli download bartowski/DeepSeek-R1-Distill-Qwen-14B-GGUF \
  DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf \
  --local-dir ./models

# Run with llama.cpp
./llama-cli \
  -m ./models/DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf \
  -ngl 999 \
  -p "<|im_start|>user\nExplain the P vs NP problem<|im_end|>\n<|im_start|>assistant\n" \
  -n 1000

Hardware Requirements Per Model Size

Model	Minimum VRAM	Recommended Setup	Tokens/Sec (approx)
R1-Distill 1.5B	2GB	Any GPU	60-100 tok/s
R1-Distill 7B	6GB	RTX 3060 12GB	30-60 tok/s
R1-Distill 8B	6GB	RTX 3060 12GB	25-50 tok/s
R1-Distill 14B	10GB	RTX 4070 12GB	20-40 tok/s
R1-Distill 32B	22GB	RTX 3090/4090	10-25 tok/s
R1-Distill 70B	45GB	2x RTX 4090	5-12 tok/s

Best Use Cases for DeepSeek R1

DeepSeek R1’s reasoning-first training makes it exceptional at:

Coding tasks:

Debugging complex logic errors
Writing algorithms with correctness requirements
Code review and refactoring suggestions

Mathematics:

Step-by-step problem solving
Proof verification
Statistical analysis and formula derivation

Logic and reasoning:

Multi-step deduction problems
Constraint satisfaction
Argument analysis and critique

Compared to Llama 3 and Mistral, R1 distills consistently outperform on MATH, HumanEval, and GPQA benchmarks at equivalent parameter counts. For creative writing or casual chat, the difference is minimal. For anything requiring careful reasoning, R1 is the better choice.