How to Run Llama 3.1 and 3.2 Locally on Your Machine
Meta’s Llama 3.1 and 3.2 models represent some of the most capable open-weight LLMs you can run entirely on your own hardware — no API keys, no usage fees, and full data privacy. This guide covers the two most popular methods: Ollama for terminal users and LM Studio for those who prefer a GUI.
Why Run Llama Locally?
- Privacy: Your prompts never leave your machine
- Cost: No per-token fees after the initial hardware investment
- Customization: Fine-tune, modify system prompts, or integrate into local workflows
- Offline use: Works without an internet connection
Hardware Requirements
The most important factor is VRAM (for GPU inference) or RAM (for CPU inference). Here’s a practical guide:
VRAM Requirements by Model Size
| Model | Quant | VRAM Required | Recommended GPU |
|---|---|---|---|
| Llama 3.2 1B | Q4_K_M | 1.5 GB | Any modern GPU |
| Llama 3.2 3B | Q4_K_M | 2.5 GB | GTX 1060+ |
| Llama 3.1 8B | Q4_K_M | 5 GB | RTX 3060+ |
| Llama 3.1 8B | Q8_0 | 9 GB | RTX 3080+ |
| Llama 3.1 70B | Q4_K_M | 40 GB | 2x RTX 3090 or A100 |
| Llama 3.1 405B | Q4_K_M | 220 GB | Multi-GPU cluster |
If your GPU doesn’t have enough VRAM, models will run on CPU using system RAM — functional but significantly slower (2–5 tokens/sec vs 30–80 tokens/sec on GPU).
Method 1: Ollama (Recommended for Developers)
Ollama is the easiest way to run LLMs locally. It handles model downloading, quantization selection, and provides an OpenAI-compatible API.
Installation
# Linux/macOS
curl -fsSL https://ollama.com/install.sh | sh
# Windows: download the installer from ollama.com
# Or via winget:
winget install Ollama.Ollama
Pulling and Running Llama Models
# Pull Llama 3.1 8B (best all-around choice for most hardware)
ollama pull llama3.1:8b
# Pull the smaller 3.2 3B model for limited VRAM
ollama pull llama3.2:3b
# Pull the vision-capable 3.2 11B multimodal model
ollama pull llama3.2:11b
# Run interactively in the terminal
ollama run llama3.1:8b
Specifying Quantization
Ollama’s default pull uses Q4_K_M quantization. To use a specific quantization level:
# Higher quality, more VRAM
ollama pull llama3.1:8b-instruct-q8_0
# Lower VRAM, slightly less quality
ollama pull llama3.1:8b-instruct-q4_0
Using the Ollama API
Ollama runs a local server at http://localhost:11434 with an OpenAI-compatible endpoint:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
response = client.chat.completions.create(
model="llama3.1:8b",
messages=[{"role": "user", "content": "Explain gradient descent simply."}]
)
print(response.choices[0].message.content)
This means any tool that supports OpenAI’s API can point to Ollama instead, including VS Code extensions, Python apps, and LangChain.
Useful Ollama Commands
ollama list # List downloaded models
ollama ps # Show running models
ollama rm llama3.1:8b # Remove a model
ollama show llama3.1:8b # Show model details
Method 2: LM Studio (GUI Approach)
LM Studio provides a polished desktop application for downloading, managing, and chatting with local models. It’s ideal for non-developers or those who prefer a visual interface.
Installation
Download from lmstudio.ai — available for Windows, macOS (Apple Silicon optimized), and Linux.
Finding and Downloading Llama 3.1
- Open LM Studio and go to the Search tab
- Type
llama-3.1in the search bar - Filter by Publisher: Meta to find official releases
- Choose your quantization (Q4_K_M is the default recommendation)
- Click Download — models are stored in
~/.lmstudio/models/
Running a Model in LM Studio
- Go to the Chat tab
- Click the model selector at the top and choose your downloaded model
- Adjust context length, temperature, and GPU layers in the sidebar
- Start chatting
LM Studio also has a built-in local API server (OpenAI-compatible) you can enable under the Local Server tab, making it a drop-in replacement for the OpenAI API.
Understanding Quantization
Quantization reduces model file size and VRAM requirements by compressing the model’s weights. The tradeoff is slight quality loss.
Common Quantization Formats
| Format | Size vs FP16 | Quality Loss | Best Use Case |
|---|---|---|---|
| Q8_0 | ~50% | Minimal | High VRAM, max quality |
| Q6_K | ~40% | Very low | Good balance |
| Q4_K_M | ~30% | Low | Most common, best balance |
| Q4_0 | ~25% | Moderate | Low VRAM systems |
| Q2_K | ~15% | High | Extreme VRAM constraints |
Q4_K_M is the sweet spot for most users — it fits 8B models on 6GB VRAM GPUs while retaining roughly 99% of the full-precision model’s benchmark scores.
Performance Benchmarks
Real-world performance on common hardware (tokens per second, 8B Q4_K_M):
| Hardware | Tokens/sec (8B Q4_K_M) |
|---|---|
| M3 Pro (18GB unified) | ~55 t/s |
| RTX 4090 (24GB) | ~120 t/s |
| RTX 3080 (10GB) | ~45 t/s |
| RTX 3060 (12GB) | ~35 t/s |
| CPU only (Ryzen 9) | ~5 t/s |
Llama 3.1 vs 3.2: Which to Use?
- Llama 3.1 8B/70B: Best for general text tasks, coding, reasoning. The 8B punches above its weight and is the go-to for most local setups.
- Llama 3.2 1B/3B: Designed for on-device use — smartphones, embedded systems, edge deployments. Very fast but less capable on complex tasks.
- Llama 3.2 11B/90B Vision: Adds multimodal (image understanding) capability. The 11B is practical for most GPU setups.
Tips for Getting the Best Results
- Use a system prompt: Tell the model its role and expected output format
- Adjust context length: Larger context = more VRAM; default 4096 is fine for most tasks
- Enable GPU offloading: In LM Studio, set GPU layers to max for your VRAM budget
- Use Modelfiles in Ollama to create custom model configurations with default parameters
# Example Ollama Modelfile
cat > Modelfile << 'EOF'
FROM llama3.1:8b
SYSTEM "You are a helpful coding assistant. Always provide working code examples."
PARAMETER temperature 0.3
PARAMETER num_ctx 8192
EOF
ollama create my-coder -f Modelfile
ollama run my-coder
Running Llama locally has never been more accessible. With Ollama or LM Studio, you can have a powerful AI assistant running on your own hardware in under 10 minutes.