Ollama has become the go-to tool for running large language models locally on your own hardware. It wraps complex model loading and inference into a simple CLI, so you can pull a model and start chatting in minutes — no cloud account, no API fees, no data leaving your machine. This guide walks you through everything from installation to advanced model management.

What Is Ollama?

Ollama is an open-source runtime that makes running LLMs locally as easy as running a Docker container. It handles model downloading, quantization, hardware acceleration (CUDA, Metal, ROCm), and exposes both a CLI and a local REST API at http://localhost:11434. Models are stored in ~/.ollama/models and can be swapped instantly.

The project supports macOS, Linux, and Windows (via WSL2 or native installer), and works with NVIDIA GPUs, AMD GPUs, and Apple Silicon out of the box.

Installation

macOS and Windows

Download the installer from the official site:

# macOS — install via Homebrew
brew install ollama

# Or download the .dmg from https://ollama.com/download

For Windows, download the .exe installer from ollama.com/download. It installs as a background service that starts automatically.

Linux

curl -fsSL https://ollama.com/install.sh | sh

This script detects your GPU (NVIDIA or AMD) and installs the appropriate CUDA or ROCm libraries. After installation, Ollama runs as a systemd service.

Verify it’s running:

ollama --version
# ollama version 0.6.x

Running Your First Model

Pull and run a model with a single command:

ollama run llama3.2

On first run, Ollama downloads the model weights (Llama 3.2 3B is about 2 GB). Once downloaded, you drop into an interactive chat session. Type /bye or press Ctrl+D to exit.

Popular Models to Try

Model	Size	Best For
`llama3.2`	2 GB	General chat, fast responses
`llama3.2:70b`	40 GB	High-quality reasoning
`mistral`	4 GB	Instruction following
`codellama`	4 GB	Code generation
`phi4`	8 GB	Reasoning, math
`deepseek-r1`	4–70 GB	Chain-of-thought reasoning
`gemma3`	2–27 GB	Google’s efficient models
`nomic-embed-text`	274 MB	Text embeddings for RAG

# Pull without running
ollama pull mistral

# Run a specific parameter size
ollama run llama3.2:70b

# Run with a system prompt
ollama run codellama "Write a Python function to parse JSON"

Model Management

# List downloaded models
ollama list

# Show model details
ollama show llama3.2

# Remove a model to free disk space
ollama rm mistral

# Copy a model under a new name
ollama cp llama3.2 my-custom-llama

Using the REST API

Ollama’s local API is compatible with the OpenAI API format, making it a drop-in replacement for many tools:

# One-shot completion
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Explain quantum computing in one paragraph",
  "stream": false
}'

# Chat-style endpoint (OpenAI-compatible)
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2",
    "messages": [
      {"role": "user", "content": "What is Ollama?"}
    ]
  }'

You can point any OpenAI-compatible tool (LangChain, Open WebUI, Continue.dev) at http://localhost:11434 and set the API key to anything — it is ignored.

Creating Custom Models with Modelfiles

Ollama’s Modelfile system lets you customize models with system prompts, temperature, and parameter tweaks:

# Modelfile
FROM llama3.2

PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 4096

SYSTEM """
You are a senior cybersecurity analyst. You answer questions about penetration testing,
network security, and vulnerability research with technical precision.
"""

Build and run it:

ollama create security-expert -f Modelfile
ollama run security-expert

Modelfile Parameters

temperature — Controls randomness (0.0–1.0). Lower is more deterministic.
num_ctx — Context window size in tokens. Larger windows use more VRAM.
top_p — Nucleus sampling threshold.
repeat_penalty — Penalizes repeated phrases.
stop — Stop sequences that end generation.

GPU Acceleration

Ollama auto-detects your GPU. Verify it’s being used:

ollama run llama3.2
# Watch for: "llm_load_tensors: GPU layers = 33"

# Check which models are loaded and GPU usage
ollama ps

VRAM Requirements

Model Size	Minimum VRAM
3B parameters	4 GB
7B parameters	8 GB
13B parameters	10 GB
70B parameters	40 GB

If you have less VRAM than needed, Ollama offloads layers to RAM, which is slower but still works.

Force CPU-only mode:

OLLAMA_NUM_GPU=0 ollama run llama3.2

Running Ollama as a Server

By default Ollama only listens on localhost. To expose it on your network:

OLLAMA_HOST=0.0.0.0:11434 ollama serve

Or set it permanently via a systemd override on Linux:

sudo systemctl edit ollama.service
# Add under [Service]:
# Environment="OLLAMA_HOST=0.0.0.0:11434"
sudo systemctl restart ollama

Python Integration

import ollama

# Simple generation
response = ollama.generate(model='llama3.2', prompt='Hello, world!')
print(response['response'])

# Chat with history
messages = [
    {'role': 'user', 'content': 'What is a firewall?'}
]
response = ollama.chat(model='llama3.2', messages=messages)
print(response['message']['content'])

# Streaming output
for chunk in ollama.generate(model='llama3.2', prompt='Count to 10', stream=True):
    print(chunk['response'], end='', flush=True)

Install the library with pip install ollama.

Useful Environment Variables

Variable	Purpose
`OLLAMA_MODELS`	Change model storage location
`OLLAMA_HOST`	Bind address and port
`OLLAMA_NUM_PARALLEL`	Max simultaneous requests
`OLLAMA_MAX_LOADED_MODELS`	Models kept in memory
`OLLAMA_FLASH_ATTENTION`	Enable flash attention for speed

Tips for Best Performance

Use Q4_K_M quantization — The default for most models, balancing quality and speed well.
Close other GPU-heavy apps — Games and video editors compete for VRAM.
Increase context only when needed — A 128K context window uses far more VRAM than 4K.
Use ollama ps — Shows loaded models and their VRAM usage in real time.
Pair with Open WebUI — Adds a ChatGPT-like browser interface to your local Ollama server.

Ollama is the foundation of any local AI stack. Once it is running, you can layer tools like Open WebUI, Continue.dev for VS Code, or LangChain on top to build powerful, fully offline AI workflows that keep your data private.