Running Gemma 3 Locally: Google’s Open AI Model Guide

Gemma 3 is Google’s family of open-weight language models released in early 2025 and updated throughout the year. Built on the same research that powers Gemini, Gemma 3 models punch well above their weight class — the 12B model regularly outperforms models twice its size on standard benchmarks. This guide covers everything you need to know to run Gemma 3 locally, from the smallest 1B variant to the capable 27B model.

What Is Gemma 3?

Gemma 3 is a family of decoder-only transformer models available in four sizes: 1B, 4B, 12B, and 27B parameters. Key characteristics:

Architecture: Transformer decoder with grouped-query attention
Context window: 128K tokens (all sizes)
Multimodal: The 4B, 12B, and 27B models support image input
License: Gemma Terms of Use (allows commercial use with attribution, restrictions apply)
Training: Trained on 14 trillion tokens including web text, code, and mathematics

Gemma 3 uses a 256K vocabulary tokenizer (vs. 32K in earlier models), which gives it better multilingual performance and more efficient tokenization for technical content.

Model Sizes and Capabilities

Model	Parameters	Context	Multimodal	Best Use Case
Gemma 3 1B	1B	32K	No	Edge devices, embedded, quick tasks
Gemma 3 4B	4B	128K	Yes	Laptop, Raspberry Pi 5, light tasks
Gemma 3 12B	12B	128K	Yes	Main workstation, best balance
Gemma 3 27B	27B	128K	Yes	High-end GPU, maximum quality

The 12B model is widely considered the sweet spot — it delivers near-27B quality on many benchmarks while being accessible to users with a single consumer GPU.

VRAM Requirements

Model	Quantization	VRAM Required
Gemma 3 1B	Q4_K_M	1 GB
Gemma 3 4B	Q4_K_M	3 GB
Gemma 3 4B	Q8_0	5 GB
Gemma 3 12B	Q4_K_M	8 GB
Gemma 3 12B	Q8_0	13 GB
Gemma 3 27B	Q4_K_M	18 GB
Gemma 3 27B	Q8_0	29 GB

The 12B Q4_K_M fits comfortably on an 8GB VRAM GPU (RTX 3060 Ti, 3070, 4060 Ti), making it the most accessible high-quality option for mainstream hardware.

Setup with Ollama

Ollama is the easiest way to run Gemma 3 locally. It handles model downloading, quantization, and provides an API server.

Install Ollama

# Linux/macOS
curl -fsSL https://ollama.com/install.sh | sh

# Windows: download from ollama.com
winget install Ollama.Ollama

Pull and Run Gemma 3

# 12B model — best all-around choice
ollama pull gemma3:12b

# 4B model — for lighter hardware
ollama pull gemma3:4b

# 1B model — ultra-fast, minimal resources
ollama pull gemma3:1b

# 27B model — maximum quality
ollama pull gemma3:27b

# Run interactively
ollama run gemma3:12b

Specify Quantization

# Higher quality (more VRAM)
ollama pull gemma3:12b-instruct-q8_0

# Balanced (default)
ollama pull gemma3:12b-instruct-q4_K_M

# Minimal VRAM
ollama pull gemma3:12b-instruct-q4_0

Using the Ollama API

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # Required but unused
)

response = client.chat.completions.create(
    model="gemma3:12b",
    messages=[
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "Write a Python function to calculate the Fibonacci sequence using memoization."}
    ]
)

print(response.choices[0].message.content)

Multimodal Capabilities

Gemma 3’s 4B, 12B, and 27B models can process images alongside text. This is handled natively through Ollama:

import ollama
import base64

# Load and encode an image
with open("screenshot.png", "rb") as f:
    image_data = base64.b64encode(f.read()).decode()

response = ollama.chat(
    model="gemma3:12b",
    messages=[
        {
            "role": "user",
            "content": "Describe what you see in this image and identify any text present.",
            "images": [image_data]
        }
    ]
)

print(response["message"]["content"])

Gemma 3’s vision capabilities are solid for:

Describing and analyzing images
Extracting text from screenshots
Answering questions about charts and diagrams
Basic UI analysis

It’s not at the level of GPT-4o Vision for complex image reasoning, but it’s impressive for an open model running locally.

Benchmark Comparisons

Gemma 3 performs exceptionally well for its size. Here’s how it compares on common benchmarks (approximate, as of early 2026):

MMLU (Academic Knowledge)

Model	Score
Gemma 3 27B	75.2%
Gemma 3 12B	74.0%
Llama 3.1 8B	66.7%
Gemma 3 4B	59.6%
Gemma 3 1B	38.4%

HumanEval (Coding)

Model	Pass@1
Gemma 3 27B	72.0%
Gemma 3 12B	71.5%
Llama 3.1 8B	62.2%
Gemma 3 4B	53.8%

The 12B model notably achieves near-27B performance on coding tasks — making it an excellent choice for developer workflows.

Generation Speed (Tokens/sec)

Tested on RTX 4090 with Q4_K_M quantization:

Model	Tokens/sec
Gemma 3 1B	~180 t/s
Gemma 3 4B	~90 t/s
Gemma 3 12B	~50 t/s
Gemma 3 27B	~25 t/s

On Apple Silicon M3 Pro:

Model	Tokens/sec
Gemma 3 4B	~65 t/s
Gemma 3 12B	~30 t/s

Creating Custom Modelfiles

Customize Gemma 3’s behavior with Ollama Modelfiles:

cat > Gemma3DevAssistant << 'EOF'
FROM gemma3:12b

SYSTEM """You are an expert software developer assistant. 
Always write clean, production-ready code with:
- Type hints (Python) or proper types (TypeScript)
- Error handling
- Brief inline comments for complex logic
- No placeholder comments like 'add logic here'

When asked for code, provide complete, working implementations."""

PARAMETER temperature 0.2
PARAMETER num_ctx 32768
PARAMETER top_p 0.9
EOF

ollama create gemma3-dev -f Gemma3DevAssistant
ollama run gemma3-dev

Gemma 3 vs Llama 3.1: Which to Choose?

Both are excellent open-weight models. Here’s when to prefer each:

Choose Gemma 3 if:

You need longer context (128K vs Llama’s 128K — comparable, but Gemma handles it more efficiently in practice)
You’re doing multilingual tasks (Gemma 3’s 256K vocabulary is better suited)
You want built-in multimodal capability at the 4B+ size
You have an 8GB GPU and want the best 12B model

Choose Llama 3.1 if:

You need the widest tool/plugin compatibility (Llama has more ecosystem support)
You’re using a framework that has specific Llama optimizations
You want the 70B parameter tier (Gemma 3 tops out at 27B)

In practice, both are worth having. Run ollama list to see which you have installed and benchmark them on your specific tasks.

Tips for Best Results

Set an explicit system prompt: Gemma 3 follows instructions well but benefits from clear role definition
Use longer context selectively: The 128K context window is available but using very long contexts slows generation
Temperature matters: For coding tasks, use 0.1–0.3. For creative writing, use 0.7–0.9
The 12B Q4_K_M is the default recommendation for most users with a modern GPU

Gemma 3 demonstrates that you don’t need massive models to get impressive results. The 12B variant in particular is a remarkable achievement — capable, efficient, multimodal, and running entirely on consumer hardware you may already own.