AI Tools #llama#ollama#local-ai

How to Run Llama 3.1 and 3.2 Locally on Your Machine

Complete guide to running Llama 3.1 and 3.2 locally using Ollama and LM Studio. Includes VRAM tables, quantization options, and benchmark comparisons.

7 min read

How to Run Llama 3.1 and 3.2 Locally on Your Machine

Meta’s Llama 3.1 and 3.2 models represent some of the most capable open-weight LLMs you can run entirely on your own hardware — no API keys, no usage fees, and full data privacy. This guide covers the two most popular methods: Ollama for terminal users and LM Studio for those who prefer a GUI.

Why Run Llama Locally?

  • Privacy: Your prompts never leave your machine
  • Cost: No per-token fees after the initial hardware investment
  • Customization: Fine-tune, modify system prompts, or integrate into local workflows
  • Offline use: Works without an internet connection

Hardware Requirements

The most important factor is VRAM (for GPU inference) or RAM (for CPU inference). Here’s a practical guide:

VRAM Requirements by Model Size

ModelQuantVRAM RequiredRecommended GPU
Llama 3.2 1BQ4_K_M1.5 GBAny modern GPU
Llama 3.2 3BQ4_K_M2.5 GBGTX 1060+
Llama 3.1 8BQ4_K_M5 GBRTX 3060+
Llama 3.1 8BQ8_09 GBRTX 3080+
Llama 3.1 70BQ4_K_M40 GB2x RTX 3090 or A100
Llama 3.1 405BQ4_K_M220 GBMulti-GPU cluster

If your GPU doesn’t have enough VRAM, models will run on CPU using system RAM — functional but significantly slower (2–5 tokens/sec vs 30–80 tokens/sec on GPU).

Ollama is the easiest way to run LLMs locally. It handles model downloading, quantization selection, and provides an OpenAI-compatible API.

Installation

# Linux/macOS
curl -fsSL https://ollama.com/install.sh | sh

# Windows: download the installer from ollama.com
# Or via winget:
winget install Ollama.Ollama

Pulling and Running Llama Models

# Pull Llama 3.1 8B (best all-around choice for most hardware)
ollama pull llama3.1:8b

# Pull the smaller 3.2 3B model for limited VRAM
ollama pull llama3.2:3b

# Pull the vision-capable 3.2 11B multimodal model
ollama pull llama3.2:11b

# Run interactively in the terminal
ollama run llama3.1:8b

Specifying Quantization

Ollama’s default pull uses Q4_K_M quantization. To use a specific quantization level:

# Higher quality, more VRAM
ollama pull llama3.1:8b-instruct-q8_0

# Lower VRAM, slightly less quality
ollama pull llama3.1:8b-instruct-q4_0

Using the Ollama API

Ollama runs a local server at http://localhost:11434 with an OpenAI-compatible endpoint:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

response = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[{"role": "user", "content": "Explain gradient descent simply."}]
)
print(response.choices[0].message.content)

This means any tool that supports OpenAI’s API can point to Ollama instead, including VS Code extensions, Python apps, and LangChain.

Useful Ollama Commands

ollama list              # List downloaded models
ollama ps                # Show running models
ollama rm llama3.1:8b   # Remove a model
ollama show llama3.1:8b  # Show model details

Method 2: LM Studio (GUI Approach)

LM Studio provides a polished desktop application for downloading, managing, and chatting with local models. It’s ideal for non-developers or those who prefer a visual interface.

Installation

Download from lmstudio.ai — available for Windows, macOS (Apple Silicon optimized), and Linux.

Finding and Downloading Llama 3.1

  1. Open LM Studio and go to the Search tab
  2. Type llama-3.1 in the search bar
  3. Filter by Publisher: Meta to find official releases
  4. Choose your quantization (Q4_K_M is the default recommendation)
  5. Click Download — models are stored in ~/.lmstudio/models/

Running a Model in LM Studio

  1. Go to the Chat tab
  2. Click the model selector at the top and choose your downloaded model
  3. Adjust context length, temperature, and GPU layers in the sidebar
  4. Start chatting

LM Studio also has a built-in local API server (OpenAI-compatible) you can enable under the Local Server tab, making it a drop-in replacement for the OpenAI API.

Understanding Quantization

Quantization reduces model file size and VRAM requirements by compressing the model’s weights. The tradeoff is slight quality loss.

Common Quantization Formats

FormatSize vs FP16Quality LossBest Use Case
Q8_0~50%MinimalHigh VRAM, max quality
Q6_K~40%Very lowGood balance
Q4_K_M~30%LowMost common, best balance
Q4_0~25%ModerateLow VRAM systems
Q2_K~15%HighExtreme VRAM constraints

Q4_K_M is the sweet spot for most users — it fits 8B models on 6GB VRAM GPUs while retaining roughly 99% of the full-precision model’s benchmark scores.

Performance Benchmarks

Real-world performance on common hardware (tokens per second, 8B Q4_K_M):

HardwareTokens/sec (8B Q4_K_M)
M3 Pro (18GB unified)~55 t/s
RTX 4090 (24GB)~120 t/s
RTX 3080 (10GB)~45 t/s
RTX 3060 (12GB)~35 t/s
CPU only (Ryzen 9)~5 t/s

Llama 3.1 vs 3.2: Which to Use?

  • Llama 3.1 8B/70B: Best for general text tasks, coding, reasoning. The 8B punches above its weight and is the go-to for most local setups.
  • Llama 3.2 1B/3B: Designed for on-device use — smartphones, embedded systems, edge deployments. Very fast but less capable on complex tasks.
  • Llama 3.2 11B/90B Vision: Adds multimodal (image understanding) capability. The 11B is practical for most GPU setups.

Tips for Getting the Best Results

  • Use a system prompt: Tell the model its role and expected output format
  • Adjust context length: Larger context = more VRAM; default 4096 is fine for most tasks
  • Enable GPU offloading: In LM Studio, set GPU layers to max for your VRAM budget
  • Use Modelfiles in Ollama to create custom model configurations with default parameters
# Example Ollama Modelfile
cat > Modelfile << 'EOF'
FROM llama3.1:8b
SYSTEM "You are a helpful coding assistant. Always provide working code examples."
PARAMETER temperature 0.3
PARAMETER num_ctx 8192
EOF

ollama create my-coder -f Modelfile
ollama run my-coder

Running Llama locally has never been more accessible. With Ollama or LM Studio, you can have a powerful AI assistant running on your own hardware in under 10 minutes.

#lm-studio #llm #local-ai #ollama #llama