Ollama has become the go-to tool for running large language models locally on your own hardware. It wraps complex model loading and inference into a simple CLI, so you can pull a model and start chatting in minutes — no cloud account, no API fees, no data leaving your machine. This guide walks you through everything from installation to advanced model management.
What Is Ollama?
Ollama is an open-source runtime that makes running LLMs locally as easy as running a Docker container. It handles model downloading, quantization, hardware acceleration (CUDA, Metal, ROCm), and exposes both a CLI and a local REST API at http://localhost:11434. Models are stored in ~/.ollama/models and can be swapped instantly.
The project supports macOS, Linux, and Windows (via WSL2 or native installer), and works with NVIDIA GPUs, AMD GPUs, and Apple Silicon out of the box.
Installation
macOS and Windows
Download the installer from the official site:
# macOS — install via Homebrew
brew install ollama
# Or download the .dmg from https://ollama.com/download
For Windows, download the .exe installer from ollama.com/download. It installs as a background service that starts automatically.
Linux
curl -fsSL https://ollama.com/install.sh | sh
This script detects your GPU (NVIDIA or AMD) and installs the appropriate CUDA or ROCm libraries. After installation, Ollama runs as a systemd service.
Verify it’s running:
ollama --version
# ollama version 0.6.x
Running Your First Model
Pull and run a model with a single command:
ollama run llama3.2
On first run, Ollama downloads the model weights (Llama 3.2 3B is about 2 GB). Once downloaded, you drop into an interactive chat session. Type /bye or press Ctrl+D to exit.
Popular Models to Try
| Model | Size | Best For |
|---|---|---|
llama3.2 | 2 GB | General chat, fast responses |
llama3.2:70b | 40 GB | High-quality reasoning |
mistral | 4 GB | Instruction following |
codellama | 4 GB | Code generation |
phi4 | 8 GB | Reasoning, math |
deepseek-r1 | 4–70 GB | Chain-of-thought reasoning |
gemma3 | 2–27 GB | Google’s efficient models |
nomic-embed-text | 274 MB | Text embeddings for RAG |
# Pull without running
ollama pull mistral
# Run a specific parameter size
ollama run llama3.2:70b
# Run with a system prompt
ollama run codellama "Write a Python function to parse JSON"
Model Management
# List downloaded models
ollama list
# Show model details
ollama show llama3.2
# Remove a model to free disk space
ollama rm mistral
# Copy a model under a new name
ollama cp llama3.2 my-custom-llama
Using the REST API
Ollama’s local API is compatible with the OpenAI API format, making it a drop-in replacement for many tools:
# One-shot completion
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "Explain quantum computing in one paragraph",
"stream": false
}'
# Chat-style endpoint (OpenAI-compatible)
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2",
"messages": [
{"role": "user", "content": "What is Ollama?"}
]
}'
You can point any OpenAI-compatible tool (LangChain, Open WebUI, Continue.dev) at http://localhost:11434 and set the API key to anything — it is ignored.
Creating Custom Models with Modelfiles
Ollama’s Modelfile system lets you customize models with system prompts, temperature, and parameter tweaks:
# Modelfile
FROM llama3.2
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 4096
SYSTEM """
You are a senior cybersecurity analyst. You answer questions about penetration testing,
network security, and vulnerability research with technical precision.
"""
Build and run it:
ollama create security-expert -f Modelfile
ollama run security-expert
Modelfile Parameters
temperature— Controls randomness (0.0–1.0). Lower is more deterministic.num_ctx— Context window size in tokens. Larger windows use more VRAM.top_p— Nucleus sampling threshold.repeat_penalty— Penalizes repeated phrases.stop— Stop sequences that end generation.
GPU Acceleration
Ollama auto-detects your GPU. Verify it’s being used:
ollama run llama3.2
# Watch for: "llm_load_tensors: GPU layers = 33"
# Check which models are loaded and GPU usage
ollama ps
VRAM Requirements
| Model Size | Minimum VRAM |
|---|---|
| 3B parameters | 4 GB |
| 7B parameters | 8 GB |
| 13B parameters | 10 GB |
| 70B parameters | 40 GB |
If you have less VRAM than needed, Ollama offloads layers to RAM, which is slower but still works.
Force CPU-only mode:
OLLAMA_NUM_GPU=0 ollama run llama3.2
Running Ollama as a Server
By default Ollama only listens on localhost. To expose it on your network:
OLLAMA_HOST=0.0.0.0:11434 ollama serve
Or set it permanently via a systemd override on Linux:
sudo systemctl edit ollama.service
# Add under [Service]:
# Environment="OLLAMA_HOST=0.0.0.0:11434"
sudo systemctl restart ollama
Python Integration
import ollama
# Simple generation
response = ollama.generate(model='llama3.2', prompt='Hello, world!')
print(response['response'])
# Chat with history
messages = [
{'role': 'user', 'content': 'What is a firewall?'}
]
response = ollama.chat(model='llama3.2', messages=messages)
print(response['message']['content'])
# Streaming output
for chunk in ollama.generate(model='llama3.2', prompt='Count to 10', stream=True):
print(chunk['response'], end='', flush=True)
Install the library with pip install ollama.
Useful Environment Variables
| Variable | Purpose |
|---|---|
OLLAMA_MODELS | Change model storage location |
OLLAMA_HOST | Bind address and port |
OLLAMA_NUM_PARALLEL | Max simultaneous requests |
OLLAMA_MAX_LOADED_MODELS | Models kept in memory |
OLLAMA_FLASH_ATTENTION | Enable flash attention for speed |
Tips for Best Performance
- Use Q4_K_M quantization — The default for most models, balancing quality and speed well.
- Close other GPU-heavy apps — Games and video editors compete for VRAM.
- Increase context only when needed — A 128K context window uses far more VRAM than 4K.
- Use
ollama ps— Shows loaded models and their VRAM usage in real time. - Pair with Open WebUI — Adds a ChatGPT-like browser interface to your local Ollama server.
Ollama is the foundation of any local AI stack. Once it is running, you can layer tools like Open WebUI, Continue.dev for VS Code, or LangChain on top to build powerful, fully offline AI workflows that keep your data private.