How to Run Llama 3.1 and 3.2 Locally on Your Machine

Meta’s Llama 3.1 and 3.2 models represent some of the most capable open-weight LLMs you can run entirely on your own hardware — no API keys, no usage fees, and full data privacy. This guide covers the two most popular methods: Ollama for terminal users and LM Studio for those who prefer a GUI.

Why Run Llama Locally?

Privacy: Your prompts never leave your machine
Cost: No per-token fees after the initial hardware investment
Customization: Fine-tune, modify system prompts, or integrate into local workflows
Offline use: Works without an internet connection

Hardware Requirements

The most important factor is VRAM (for GPU inference) or RAM (for CPU inference). Here’s a practical guide:

VRAM Requirements by Model Size

Model	Quant	VRAM Required	Recommended GPU
Llama 3.2 1B	Q4_K_M	1.5 GB	Any modern GPU
Llama 3.2 3B	Q4_K_M	2.5 GB	GTX 1060+
Llama 3.1 8B	Q4_K_M	5 GB	RTX 3060+
Llama 3.1 8B	Q8_0	9 GB	RTX 3080+
Llama 3.1 70B	Q4_K_M	40 GB	2x RTX 3090 or A100
Llama 3.1 405B	Q4_K_M	220 GB	Multi-GPU cluster

If your GPU doesn’t have enough VRAM, models will run on CPU using system RAM — functional but significantly slower (2–5 tokens/sec vs 30–80 tokens/sec on GPU).

Method 1: Ollama (Recommended for Developers)

Ollama is the easiest way to run LLMs locally. It handles model downloading, quantization selection, and provides an OpenAI-compatible API.

Installation

# Linux/macOS
curl -fsSL https://ollama.com/install.sh | sh

# Windows: download the installer from ollama.com
# Or via winget:
winget install Ollama.Ollama

Pulling and Running Llama Models

# Pull Llama 3.1 8B (best all-around choice for most hardware)
ollama pull llama3.1:8b

# Pull the smaller 3.2 3B model for limited VRAM
ollama pull llama3.2:3b

# Pull the vision-capable 3.2 11B multimodal model
ollama pull llama3.2:11b

# Run interactively in the terminal
ollama run llama3.1:8b

Specifying Quantization

Ollama’s default pull uses Q4_K_M quantization. To use a specific quantization level:

# Higher quality, more VRAM
ollama pull llama3.1:8b-instruct-q8_0

# Lower VRAM, slightly less quality
ollama pull llama3.1:8b-instruct-q4_0

Using the Ollama API

Ollama runs a local server at http://localhost:11434 with an OpenAI-compatible endpoint:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

response = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[{"role": "user", "content": "Explain gradient descent simply."}]
)
print(response.choices[0].message.content)

This means any tool that supports OpenAI’s API can point to Ollama instead, including VS Code extensions, Python apps, and LangChain.

Useful Ollama Commands

ollama list              # List downloaded models
ollama ps                # Show running models
ollama rm llama3.1:8b   # Remove a model
ollama show llama3.1:8b  # Show model details

Method 2: LM Studio (GUI Approach)

LM Studio provides a polished desktop application for downloading, managing, and chatting with local models. It’s ideal for non-developers or those who prefer a visual interface.

Installation

Download from lmstudio.ai — available for Windows, macOS (Apple Silicon optimized), and Linux.

Finding and Downloading Llama 3.1

Open LM Studio and go to the Search tab
Type llama-3.1 in the search bar
Filter by Publisher: Meta to find official releases
Choose your quantization (Q4_K_M is the default recommendation)
Click Download — models are stored in ~/.lmstudio/models/

Running a Model in LM Studio

Go to the Chat tab
Click the model selector at the top and choose your downloaded model
Adjust context length, temperature, and GPU layers in the sidebar
Start chatting

LM Studio also has a built-in local API server (OpenAI-compatible) you can enable under the Local Server tab, making it a drop-in replacement for the OpenAI API.

Understanding Quantization

Quantization reduces model file size and VRAM requirements by compressing the model’s weights. The tradeoff is slight quality loss.

Common Quantization Formats

Format	Size vs FP16	Quality Loss	Best Use Case
Q8_0	~50%	Minimal	High VRAM, max quality
Q6_K	~40%	Very low	Good balance
Q4_K_M	~30%	Low	Most common, best balance
Q4_0	~25%	Moderate	Low VRAM systems
Q2_K	~15%	High	Extreme VRAM constraints

Q4_K_M is the sweet spot for most users — it fits 8B models on 6GB VRAM GPUs while retaining roughly 99% of the full-precision model’s benchmark scores.

Performance Benchmarks

Real-world performance on common hardware (tokens per second, 8B Q4_K_M):

Hardware	Tokens/sec (8B Q4_K_M)
M3 Pro (18GB unified)	~55 t/s
RTX 4090 (24GB)	~120 t/s
RTX 3080 (10GB)	~45 t/s
RTX 3060 (12GB)	~35 t/s
CPU only (Ryzen 9)	~5 t/s

Llama 3.1 vs 3.2: Which to Use?

Llama 3.1 8B/70B: Best for general text tasks, coding, reasoning. The 8B punches above its weight and is the go-to for most local setups.
Llama 3.2 1B/3B: Designed for on-device use — smartphones, embedded systems, edge deployments. Very fast but less capable on complex tasks.
Llama 3.2 11B/90B Vision: Adds multimodal (image understanding) capability. The 11B is practical for most GPU setups.

Tips for Getting the Best Results

Use a system prompt: Tell the model its role and expected output format
Adjust context length: Larger context = more VRAM; default 4096 is fine for most tasks
Enable GPU offloading: In LM Studio, set GPU layers to max for your VRAM budget
Use Modelfiles in Ollama to create custom model configurations with default parameters

# Example Ollama Modelfile
cat > Modelfile << 'EOF'
FROM llama3.1:8b
SYSTEM "You are a helpful coding assistant. Always provide working code examples."
PARAMETER temperature 0.3
PARAMETER num_ctx 8192
EOF

ollama create my-coder -f Modelfile
ollama run my-coder

Running Llama locally has never been more accessible. With Ollama or LM Studio, you can have a powerful AI assistant running on your own hardware in under 10 minutes.