AI Tools #gemma#google#ollama

Running Gemma 3 Locally: Google's Open AI Model Guide

Run Gemma 3 locally with Ollama: sizes 1B to 27B, VRAM requirements, multimodal capabilities, benchmark comparisons, and practical setup instructions.

7 min read

Running Gemma 3 Locally: Google’s Open AI Model Guide

Gemma 3 is Google’s family of open-weight language models released in early 2025 and updated throughout the year. Built on the same research that powers Gemini, Gemma 3 models punch well above their weight class — the 12B model regularly outperforms models twice its size on standard benchmarks. This guide covers everything you need to know to run Gemma 3 locally, from the smallest 1B variant to the capable 27B model.

What Is Gemma 3?

Gemma 3 is a family of decoder-only transformer models available in four sizes: 1B, 4B, 12B, and 27B parameters. Key characteristics:

  • Architecture: Transformer decoder with grouped-query attention
  • Context window: 128K tokens (all sizes)
  • Multimodal: The 4B, 12B, and 27B models support image input
  • License: Gemma Terms of Use (allows commercial use with attribution, restrictions apply)
  • Training: Trained on 14 trillion tokens including web text, code, and mathematics

Gemma 3 uses a 256K vocabulary tokenizer (vs. 32K in earlier models), which gives it better multilingual performance and more efficient tokenization for technical content.

Model Sizes and Capabilities

ModelParametersContextMultimodalBest Use Case
Gemma 3 1B1B32KNoEdge devices, embedded, quick tasks
Gemma 3 4B4B128KYesLaptop, Raspberry Pi 5, light tasks
Gemma 3 12B12B128KYesMain workstation, best balance
Gemma 3 27B27B128KYesHigh-end GPU, maximum quality

The 12B model is widely considered the sweet spot — it delivers near-27B quality on many benchmarks while being accessible to users with a single consumer GPU.

VRAM Requirements

ModelQuantizationVRAM Required
Gemma 3 1BQ4_K_M1 GB
Gemma 3 4BQ4_K_M3 GB
Gemma 3 4BQ8_05 GB
Gemma 3 12BQ4_K_M8 GB
Gemma 3 12BQ8_013 GB
Gemma 3 27BQ4_K_M18 GB
Gemma 3 27BQ8_029 GB

The 12B Q4_K_M fits comfortably on an 8GB VRAM GPU (RTX 3060 Ti, 3070, 4060 Ti), making it the most accessible high-quality option for mainstream hardware.

Setup with Ollama

Ollama is the easiest way to run Gemma 3 locally. It handles model downloading, quantization, and provides an API server.

Install Ollama

# Linux/macOS
curl -fsSL https://ollama.com/install.sh | sh

# Windows: download from ollama.com
winget install Ollama.Ollama

Pull and Run Gemma 3

# 12B model — best all-around choice
ollama pull gemma3:12b

# 4B model — for lighter hardware
ollama pull gemma3:4b

# 1B model — ultra-fast, minimal resources
ollama pull gemma3:1b

# 27B model — maximum quality
ollama pull gemma3:27b

# Run interactively
ollama run gemma3:12b

Specify Quantization

# Higher quality (more VRAM)
ollama pull gemma3:12b-instruct-q8_0

# Balanced (default)
ollama pull gemma3:12b-instruct-q4_K_M

# Minimal VRAM
ollama pull gemma3:12b-instruct-q4_0

Using the Ollama API

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # Required but unused
)

response = client.chat.completions.create(
    model="gemma3:12b",
    messages=[
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "Write a Python function to calculate the Fibonacci sequence using memoization."}
    ]
)

print(response.choices[0].message.content)

Multimodal Capabilities

Gemma 3’s 4B, 12B, and 27B models can process images alongside text. This is handled natively through Ollama:

import ollama
import base64

# Load and encode an image
with open("screenshot.png", "rb") as f:
    image_data = base64.b64encode(f.read()).decode()

response = ollama.chat(
    model="gemma3:12b",
    messages=[
        {
            "role": "user",
            "content": "Describe what you see in this image and identify any text present.",
            "images": [image_data]
        }
    ]
)

print(response["message"]["content"])

Gemma 3’s vision capabilities are solid for:

  • Describing and analyzing images
  • Extracting text from screenshots
  • Answering questions about charts and diagrams
  • Basic UI analysis

It’s not at the level of GPT-4o Vision for complex image reasoning, but it’s impressive for an open model running locally.

Benchmark Comparisons

Gemma 3 performs exceptionally well for its size. Here’s how it compares on common benchmarks (approximate, as of early 2026):

MMLU (Academic Knowledge)

ModelScore
Gemma 3 27B75.2%
Gemma 3 12B74.0%
Llama 3.1 8B66.7%
Gemma 3 4B59.6%
Gemma 3 1B38.4%

HumanEval (Coding)

ModelPass@1
Gemma 3 27B72.0%
Gemma 3 12B71.5%
Llama 3.1 8B62.2%
Gemma 3 4B53.8%

The 12B model notably achieves near-27B performance on coding tasks — making it an excellent choice for developer workflows.

Generation Speed (Tokens/sec)

Tested on RTX 4090 with Q4_K_M quantization:

ModelTokens/sec
Gemma 3 1B~180 t/s
Gemma 3 4B~90 t/s
Gemma 3 12B~50 t/s
Gemma 3 27B~25 t/s

On Apple Silicon M3 Pro:

ModelTokens/sec
Gemma 3 4B~65 t/s
Gemma 3 12B~30 t/s

Creating Custom Modelfiles

Customize Gemma 3’s behavior with Ollama Modelfiles:

cat > Gemma3DevAssistant << 'EOF'
FROM gemma3:12b

SYSTEM """You are an expert software developer assistant. 
Always write clean, production-ready code with:
- Type hints (Python) or proper types (TypeScript)
- Error handling
- Brief inline comments for complex logic
- No placeholder comments like 'add logic here'

When asked for code, provide complete, working implementations."""

PARAMETER temperature 0.2
PARAMETER num_ctx 32768
PARAMETER top_p 0.9
EOF

ollama create gemma3-dev -f Gemma3DevAssistant
ollama run gemma3-dev

Gemma 3 vs Llama 3.1: Which to Choose?

Both are excellent open-weight models. Here’s when to prefer each:

Choose Gemma 3 if:

  • You need longer context (128K vs Llama’s 128K — comparable, but Gemma handles it more efficiently in practice)
  • You’re doing multilingual tasks (Gemma 3’s 256K vocabulary is better suited)
  • You want built-in multimodal capability at the 4B+ size
  • You have an 8GB GPU and want the best 12B model

Choose Llama 3.1 if:

  • You need the widest tool/plugin compatibility (Llama has more ecosystem support)
  • You’re using a framework that has specific Llama optimizations
  • You want the 70B parameter tier (Gemma 3 tops out at 27B)

In practice, both are worth having. Run ollama list to see which you have installed and benchmark them on your specific tasks.

Tips for Best Results

  • Set an explicit system prompt: Gemma 3 follows instructions well but benefits from clear role definition
  • Use longer context selectively: The 128K context window is available but using very long contexts slows generation
  • Temperature matters: For coding tasks, use 0.1–0.3. For creative writing, use 0.7–0.9
  • The 12B Q4_K_M is the default recommendation for most users with a modern GPU

Gemma 3 demonstrates that you don’t need massive models to get impressive results. The 12B variant in particular is a remarkable achievement — capable, efficient, multimodal, and running entirely on consumer hardware you may already own.

#llm #local-ai #ollama #google #gemma