Running AI models locally on your own hardware used to require a PhD and a server rack. In 2026, you can have a capable AI assistant running privately on a mid-range gaming PC in under 10 minutes using Ollama.
No API keys. No usage fees. No data sent to third-party servers. Your conversations stay on your machine.
Why Run AI Locally?
Before diving in, here’s why local models beat cloud services for many use cases:
- Privacy: Your prompts, documents, and queries never leave your machine
- No rate limits: Run as many queries as your hardware allows
- Offline access: Works without internet — useful for travel, air-gapped environments, or unstable connections
- Cost: Free after the initial hardware investment
- Customisation: Fine-tune models on your own data, create custom system prompts, build local tools
The tradeoff: local models require decent hardware and aren’t quite as capable as GPT-4o or Claude 3.5 Sonnet for complex reasoning. For most everyday tasks — summarisation, code review, writing assistance, Q&A — they’re excellent.
Hardware Requirements
| Tier | GPU | VRAM | What You Can Run |
|---|---|---|---|
| Budget | RTX 3060 / RX 6700 | 8–12 GB | 7B and 8B parameter models |
| Mid-range | RTX 4070 / RX 7800 XT | 12–16 GB | 13B models, quantised 34B |
| High-end | RTX 4090 / RX 7900 XTX | 24 GB | 34B models comfortably |
| Workstation | 2× RTX 4090 / A6000 | 48 GB+ | 70B models |
No GPU? Ollama also runs on CPU, just much slower. An 8B model on a modern CPU (Ryzen 7, Core i7) takes 5–15 seconds per response — usable, not fast.
Apple Silicon (M1/M2/M3 Mac) uses unified memory and runs these models exceptionally well without a discrete GPU.
Step 1 — Install Ollama
Windows
Download the installer from ollama.ai and run it. Ollama installs as a background service.
Alternatively, via winget:
winget install Ollama.Ollama
Linux
curl -fsSL https://ollama.ai/install.sh | sh
This installs the ollama binary and configures a systemd service.
macOS
brew install ollama
Or download the .dmg from the website.
Step 2 — Start Ollama
On Windows and macOS, Ollama starts automatically after installation. You’ll see it in the system tray.
On Linux, start the service:
sudo systemctl start ollama
sudo systemctl enable ollama # Auto-start on boot
Verify it’s running:
ollama --version
Step 3 — Pull Your First Model
The ollama pull command downloads a model from the Ollama library. Models are stored in ~/.ollama/models (or C:\Users\<you>\.ollama\models on Windows).
# Fast, capable all-rounder (4.7 GB)
ollama pull llama3.2
# Excellent for coding (3.8 GB)
ollama pull codellama
# Very fast, great for quick tasks (4.1 GB)
ollama pull mistral
# Strong reasoning, 27B parameters (15 GB)
ollama pull gemma2:27b
Step 4 — Run a Model
Interactive chat
ollama run llama3.2
This drops you into a terminal chat interface. Type your prompt and press Enter. Type /bye to exit.
Single query
ollama run mistral "Explain the difference between TCP and UDP in simple terms"
Pipe input
cat suspicious_script.py | ollama run codellama "Review this code for security vulnerabilities"
This is incredibly useful for security research — pipe log files, code, or config files directly into the model for analysis.
Step 5 — Use the API
Ollama exposes a REST API on localhost:11434 compatible with the OpenAI API format. This means any tool built for OpenAI (LangChain, Open WebUI, Continue IDE plugin, etc.) works with Ollama with zero or minimal changes.
# Basic completion
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "What is SQL injection?",
"stream": false
}'
# Chat format (OpenAI-compatible)
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2",
"messages": [
{"role": "user", "content": "Explain XSS in 3 sentences"}
]
}'
Best Models for Each Use Case
General use / writing
ollama pull llama3.2 # Meta's Llama 3.2 8B — best all-rounder
ollama pull gemma2 # Google's Gemma 2 9B — excellent instruction following
Coding and security research
ollama pull codellama # Meta's CodeLlama — code generation and review
ollama pull deepseek-coder # DeepSeek Coder — strong at low-level and exploit code
ollama pull qwen2.5-coder # Qwen 2.5 Coder — surprisingly capable at 7B
Fast responses / low VRAM
ollama pull mistral # Mistral 7B — fast and punches above its weight
ollama pull phi3:mini # Microsoft Phi-3 Mini 3.8B — tiny but surprisingly smart
Long context (documents, large codebases)
ollama pull llama3.2:70b # Needs 40+ GB VRAM, but handles 128K context
ollama pull qwen2.5:72b # Alibaba's Qwen 2.5 72B — strong long-context reasoning
Add a Web UI — Open WebUI
The terminal is fine, but most people prefer a chat interface. Open WebUI is a self-hosted, ChatGPT-style interface for Ollama.
Install with Docker
docker run -d \
-p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data \
--name open-webui \
--restart always \
ghcr.io/open-webui/open-webui:main
Then open http://localhost:3000 in your browser.
Install without Docker (Python)
pip install open-webui
open-webui serve
Open WebUI gives you:
- Multi-model chat with conversation history
- File upload and document Q&A (RAG)
- Image generation (with Stable Diffusion integration)
- Model management from the browser
Privacy Use Cases
Local AI is especially powerful for tasks you’d never want to send to a cloud service:
Security research:
cat malware_sample.js | ollama run codellama "What does this script do? Is it malicious?"
Log analysis:
cat /var/log/auth.log | tail -500 | ollama run mistral "Summarise any suspicious authentication events"
Document review:
cat confidential_report.txt | ollama run llama3.2 "Summarise the key points in 5 bullet points"
Code review before committing:
git diff | ollama run codellama "Review this diff for bugs, security issues, and code quality problems"
Performance Tips
Use quantised models: Ollama automatically downloads 4-bit quantised models by default (Q4_K_M), which use ~50% less VRAM than full precision with minimal quality loss.
Keep models in VRAM: The first query after loading a model is slow (loading from disk). Subsequent queries are fast. Keep your most-used model running in the background.
Use GPU offloading: If a model is slightly too large for your VRAM, Ollama will automatically offload some layers to CPU. It’s slower but works.
Set concurrency: For API use, Ollama handles multiple requests but defaults to one at a time. Set OLLAMA_NUM_PARALLEL=2 for concurrent requests on high-VRAM GPUs.
# Windows — set environment variable
$env:OLLAMA_NUM_PARALLEL = "2"
ollama serve
Troubleshooting
Model downloads slowly: Ollama downloads from Cloudflare-hosted mirrors. If your ISP throttles large downloads, try at a different time or use a VPN.
GPU not detected: Ensure you have up-to-date GPU drivers. For NVIDIA, you need CUDA 11.8+. Check with nvidia-smi. For AMD on Linux, ROCm support depends on your GPU generation.
Out of memory errors: Use a smaller model or a more aggressively quantised version:
ollama pull llama3.2:8b-instruct-q4_K_S # Smallest 8B variant
Slow on CPU: This is expected. CPU inference runs at ~5–20 tokens/second vs 50–150 tokens/second on a mid-range GPU. Upgrade your hardware or use a cloud API for latency-sensitive tasks.
Conclusion
Ollama makes local AI genuinely accessible. Whether you’re a security researcher who needs to analyse sensitive data privately, a developer who wants AI assistance without code leaving your machine, or just someone curious about LLMs without paying subscription fees — Ollama is the best starting point in 2026.
Start with llama3.2 for general use, add codellama if you work with code, and front it with Open WebUI for a polished experience. The whole stack is free, open-source, and runs on hardware you already own.