How to Run Local AI on Your PC: Complete Ollama Guide 2026

Running AI models locally on your own hardware used to require a PhD and a server rack. In 2026, you can have a capable AI assistant running privately on a mid-range gaming PC in under 10 minutes using Ollama.

No API keys. No usage fees. No data sent to third-party servers. Your conversations stay on your machine.

Why Run AI Locally?

Before diving in, here’s why local models beat cloud services for many use cases:

Privacy: Your prompts, documents, and queries never leave your machine
No rate limits: Run as many queries as your hardware allows
Offline access: Works without internet — useful for travel, air-gapped environments, or unstable connections
Cost: Free after the initial hardware investment
Customisation: Fine-tune models on your own data, create custom system prompts, build local tools

The tradeoff: local models require decent hardware and aren’t quite as capable as GPT-4o or Claude 3.5 Sonnet for complex reasoning. For most everyday tasks — summarisation, code review, writing assistance, Q&A — they’re excellent.

Hardware Requirements

Tier	GPU	VRAM	What You Can Run
Budget	RTX 3060 / RX 6700	8–12 GB	7B and 8B parameter models
Mid-range	RTX 4070 / RX 7800 XT	12–16 GB	13B models, quantised 34B
High-end	RTX 4090 / RX 7900 XTX	24 GB	34B models comfortably
Workstation	2× RTX 4090 / A6000	48 GB+	70B models

No GPU? Ollama also runs on CPU, just much slower. An 8B model on a modern CPU (Ryzen 7, Core i7) takes 5–15 seconds per response — usable, not fast.

Apple Silicon (M1/M2/M3 Mac) uses unified memory and runs these models exceptionally well without a discrete GPU.

Step 1 — Install Ollama

Windows

Download the installer from ollama.ai and run it. Ollama installs as a background service.

Alternatively, via winget:

winget install Ollama.Ollama

Linux

curl -fsSL https://ollama.ai/install.sh | sh

This installs the ollama binary and configures a systemd service.

macOS

brew install ollama

Or download the .dmg from the website.

Step 2 — Start Ollama

On Windows and macOS, Ollama starts automatically after installation. You’ll see it in the system tray.

On Linux, start the service:

sudo systemctl start ollama
sudo systemctl enable ollama  # Auto-start on boot

Verify it’s running:

ollama --version

Step 3 — Pull Your First Model

The ollama pull command downloads a model from the Ollama library. Models are stored in ~/.ollama/models (or C:\Users\<you>\.ollama\models on Windows).

# Fast, capable all-rounder (4.7 GB)
ollama pull llama3.2

# Excellent for coding (3.8 GB)
ollama pull codellama

# Very fast, great for quick tasks (4.1 GB)
ollama pull mistral

# Strong reasoning, 27B parameters (15 GB)
ollama pull gemma2:27b

Step 4 — Run a Model

Interactive chat

ollama run llama3.2

This drops you into a terminal chat interface. Type your prompt and press Enter. Type /bye to exit.

Single query

ollama run mistral "Explain the difference between TCP and UDP in simple terms"

Pipe input

cat suspicious_script.py | ollama run codellama "Review this code for security vulnerabilities"

This is incredibly useful for security research — pipe log files, code, or config files directly into the model for analysis.

Step 5 — Use the API

Ollama exposes a REST API on localhost:11434 compatible with the OpenAI API format. This means any tool built for OpenAI (LangChain, Open WebUI, Continue IDE plugin, etc.) works with Ollama with zero or minimal changes.

# Basic completion
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "What is SQL injection?",
  "stream": false
}'

# Chat format (OpenAI-compatible)
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2",
    "messages": [
      {"role": "user", "content": "Explain XSS in 3 sentences"}
    ]
  }'

Best Models for Each Use Case

General use / writing

ollama pull llama3.2        # Meta's Llama 3.2 8B — best all-rounder
ollama pull gemma2          # Google's Gemma 2 9B — excellent instruction following

Coding and security research

ollama pull codellama       # Meta's CodeLlama — code generation and review
ollama pull deepseek-coder  # DeepSeek Coder — strong at low-level and exploit code
ollama pull qwen2.5-coder   # Qwen 2.5 Coder — surprisingly capable at 7B

Fast responses / low VRAM

ollama pull mistral         # Mistral 7B — fast and punches above its weight
ollama pull phi3:mini       # Microsoft Phi-3 Mini 3.8B — tiny but surprisingly smart

Long context (documents, large codebases)

ollama pull llama3.2:70b    # Needs 40+ GB VRAM, but handles 128K context
ollama pull qwen2.5:72b     # Alibaba's Qwen 2.5 72B — strong long-context reasoning

Add a Web UI — Open WebUI

The terminal is fine, but most people prefer a chat interface. Open WebUI is a self-hosted, ChatGPT-style interface for Ollama.

Install with Docker

docker run -d \
  -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Then open http://localhost:3000 in your browser.

Install without Docker (Python)

pip install open-webui
open-webui serve

Open WebUI gives you:

Multi-model chat with conversation history
File upload and document Q&A (RAG)
Image generation (with Stable Diffusion integration)
Model management from the browser

Privacy Use Cases

Local AI is especially powerful for tasks you’d never want to send to a cloud service:

Security research:

cat malware_sample.js | ollama run codellama "What does this script do? Is it malicious?"

Log analysis:

cat /var/log/auth.log | tail -500 | ollama run mistral "Summarise any suspicious authentication events"

Document review:

cat confidential_report.txt | ollama run llama3.2 "Summarise the key points in 5 bullet points"

Code review before committing:

git diff | ollama run codellama "Review this diff for bugs, security issues, and code quality problems"

Performance Tips

Use quantised models: Ollama automatically downloads 4-bit quantised models by default (Q4_K_M), which use ~50% less VRAM than full precision with minimal quality loss.

Keep models in VRAM: The first query after loading a model is slow (loading from disk). Subsequent queries are fast. Keep your most-used model running in the background.

Use GPU offloading: If a model is slightly too large for your VRAM, Ollama will automatically offload some layers to CPU. It’s slower but works.

Set concurrency: For API use, Ollama handles multiple requests but defaults to one at a time. Set OLLAMA_NUM_PARALLEL=2 for concurrent requests on high-VRAM GPUs.

# Windows — set environment variable
$env:OLLAMA_NUM_PARALLEL = "2"
ollama serve

Troubleshooting

Model downloads slowly: Ollama downloads from Cloudflare-hosted mirrors. If your ISP throttles large downloads, try at a different time or use a VPN.

GPU not detected: Ensure you have up-to-date GPU drivers. For NVIDIA, you need CUDA 11.8+. Check with nvidia-smi. For AMD on Linux, ROCm support depends on your GPU generation.

Out of memory errors: Use a smaller model or a more aggressively quantised version:

ollama pull llama3.2:8b-instruct-q4_K_S  # Smallest 8B variant

Slow on CPU: This is expected. CPU inference runs at ~5–20 tokens/second vs 50–150 tokens/second on a mid-range GPU. Upgrade your hardware or use a cloud API for latency-sensitive tasks.

Conclusion

Ollama makes local AI genuinely accessible. Whether you’re a security researcher who needs to analyse sensitive data privately, a developer who wants AI assistance without code leaving your machine, or just someone curious about LLMs without paying subscription fees — Ollama is the best starting point in 2026.

Start with llama3.2 for general use, add codellama if you work with code, and front it with Open WebUI for a polished experience. The whole stack is free, open-source, and runs on hardware you already own.