Running Gemma 3 Locally: Google’s Open AI Model Guide
Gemma 3 is Google’s family of open-weight language models released in early 2025 and updated throughout the year. Built on the same research that powers Gemini, Gemma 3 models punch well above their weight class — the 12B model regularly outperforms models twice its size on standard benchmarks. This guide covers everything you need to know to run Gemma 3 locally, from the smallest 1B variant to the capable 27B model.
What Is Gemma 3?
Gemma 3 is a family of decoder-only transformer models available in four sizes: 1B, 4B, 12B, and 27B parameters. Key characteristics:
- Architecture: Transformer decoder with grouped-query attention
- Context window: 128K tokens (all sizes)
- Multimodal: The 4B, 12B, and 27B models support image input
- License: Gemma Terms of Use (allows commercial use with attribution, restrictions apply)
- Training: Trained on 14 trillion tokens including web text, code, and mathematics
Gemma 3 uses a 256K vocabulary tokenizer (vs. 32K in earlier models), which gives it better multilingual performance and more efficient tokenization for technical content.
Model Sizes and Capabilities
| Model | Parameters | Context | Multimodal | Best Use Case |
|---|---|---|---|---|
| Gemma 3 1B | 1B | 32K | No | Edge devices, embedded, quick tasks |
| Gemma 3 4B | 4B | 128K | Yes | Laptop, Raspberry Pi 5, light tasks |
| Gemma 3 12B | 12B | 128K | Yes | Main workstation, best balance |
| Gemma 3 27B | 27B | 128K | Yes | High-end GPU, maximum quality |
The 12B model is widely considered the sweet spot — it delivers near-27B quality on many benchmarks while being accessible to users with a single consumer GPU.
VRAM Requirements
| Model | Quantization | VRAM Required |
|---|---|---|
| Gemma 3 1B | Q4_K_M | 1 GB |
| Gemma 3 4B | Q4_K_M | 3 GB |
| Gemma 3 4B | Q8_0 | 5 GB |
| Gemma 3 12B | Q4_K_M | 8 GB |
| Gemma 3 12B | Q8_0 | 13 GB |
| Gemma 3 27B | Q4_K_M | 18 GB |
| Gemma 3 27B | Q8_0 | 29 GB |
The 12B Q4_K_M fits comfortably on an 8GB VRAM GPU (RTX 3060 Ti, 3070, 4060 Ti), making it the most accessible high-quality option for mainstream hardware.
Setup with Ollama
Ollama is the easiest way to run Gemma 3 locally. It handles model downloading, quantization, and provides an API server.
Install Ollama
# Linux/macOS
curl -fsSL https://ollama.com/install.sh | sh
# Windows: download from ollama.com
winget install Ollama.Ollama
Pull and Run Gemma 3
# 12B model — best all-around choice
ollama pull gemma3:12b
# 4B model — for lighter hardware
ollama pull gemma3:4b
# 1B model — ultra-fast, minimal resources
ollama pull gemma3:1b
# 27B model — maximum quality
ollama pull gemma3:27b
# Run interactively
ollama run gemma3:12b
Specify Quantization
# Higher quality (more VRAM)
ollama pull gemma3:12b-instruct-q8_0
# Balanced (default)
ollama pull gemma3:12b-instruct-q4_K_M
# Minimal VRAM
ollama pull gemma3:12b-instruct-q4_0
Using the Ollama API
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # Required but unused
)
response = client.chat.completions.create(
model="gemma3:12b",
messages=[
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "Write a Python function to calculate the Fibonacci sequence using memoization."}
]
)
print(response.choices[0].message.content)
Multimodal Capabilities
Gemma 3’s 4B, 12B, and 27B models can process images alongside text. This is handled natively through Ollama:
import ollama
import base64
# Load and encode an image
with open("screenshot.png", "rb") as f:
image_data = base64.b64encode(f.read()).decode()
response = ollama.chat(
model="gemma3:12b",
messages=[
{
"role": "user",
"content": "Describe what you see in this image and identify any text present.",
"images": [image_data]
}
]
)
print(response["message"]["content"])
Gemma 3’s vision capabilities are solid for:
- Describing and analyzing images
- Extracting text from screenshots
- Answering questions about charts and diagrams
- Basic UI analysis
It’s not at the level of GPT-4o Vision for complex image reasoning, but it’s impressive for an open model running locally.
Benchmark Comparisons
Gemma 3 performs exceptionally well for its size. Here’s how it compares on common benchmarks (approximate, as of early 2026):
MMLU (Academic Knowledge)
| Model | Score |
|---|---|
| Gemma 3 27B | 75.2% |
| Gemma 3 12B | 74.0% |
| Llama 3.1 8B | 66.7% |
| Gemma 3 4B | 59.6% |
| Gemma 3 1B | 38.4% |
HumanEval (Coding)
| Model | Pass@1 |
|---|---|
| Gemma 3 27B | 72.0% |
| Gemma 3 12B | 71.5% |
| Llama 3.1 8B | 62.2% |
| Gemma 3 4B | 53.8% |
The 12B model notably achieves near-27B performance on coding tasks — making it an excellent choice for developer workflows.
Generation Speed (Tokens/sec)
Tested on RTX 4090 with Q4_K_M quantization:
| Model | Tokens/sec |
|---|---|
| Gemma 3 1B | ~180 t/s |
| Gemma 3 4B | ~90 t/s |
| Gemma 3 12B | ~50 t/s |
| Gemma 3 27B | ~25 t/s |
On Apple Silicon M3 Pro:
| Model | Tokens/sec |
|---|---|
| Gemma 3 4B | ~65 t/s |
| Gemma 3 12B | ~30 t/s |
Creating Custom Modelfiles
Customize Gemma 3’s behavior with Ollama Modelfiles:
cat > Gemma3DevAssistant << 'EOF'
FROM gemma3:12b
SYSTEM """You are an expert software developer assistant.
Always write clean, production-ready code with:
- Type hints (Python) or proper types (TypeScript)
- Error handling
- Brief inline comments for complex logic
- No placeholder comments like 'add logic here'
When asked for code, provide complete, working implementations."""
PARAMETER temperature 0.2
PARAMETER num_ctx 32768
PARAMETER top_p 0.9
EOF
ollama create gemma3-dev -f Gemma3DevAssistant
ollama run gemma3-dev
Gemma 3 vs Llama 3.1: Which to Choose?
Both are excellent open-weight models. Here’s when to prefer each:
Choose Gemma 3 if:
- You need longer context (128K vs Llama’s 128K — comparable, but Gemma handles it more efficiently in practice)
- You’re doing multilingual tasks (Gemma 3’s 256K vocabulary is better suited)
- You want built-in multimodal capability at the 4B+ size
- You have an 8GB GPU and want the best 12B model
Choose Llama 3.1 if:
- You need the widest tool/plugin compatibility (Llama has more ecosystem support)
- You’re using a framework that has specific Llama optimizations
- You want the 70B parameter tier (Gemma 3 tops out at 27B)
In practice, both are worth having. Run ollama list to see which you have installed and benchmark them on your specific tasks.
Tips for Best Results
- Set an explicit system prompt: Gemma 3 follows instructions well but benefits from clear role definition
- Use longer context selectively: The 128K context window is available but using very long contexts slows generation
- Temperature matters: For coding tasks, use 0.1–0.3. For creative writing, use 0.7–0.9
- The 12B Q4_K_M is the default recommendation for most users with a modern GPU
Gemma 3 demonstrates that you don’t need massive models to get impressive results. The 12B variant in particular is a remarkable achievement — capable, efficient, multimodal, and running entirely on consumer hardware you may already own.