Google’s Gemma 3, released in early 2025, brought a meaningful leap in open-weight model quality. The model family spans from a 1B parameter edge-deployment model up to a 27B powerhouse that outcompetes models twice its size on several benchmarks. Critically, Gemma 3 4B and above include multimodal capabilities — the ability to understand and reason about images alongside text.

The Gemma 3 Model Family

Model	Parameters	Context Length	Multimodal	Recommended VRAM
Gemma 3 1B	1B	32K	No	2GB (Q4)
Gemma 3 4B	4B	128K	Yes	4GB (Q4)
Gemma 3 12B	12B	128K	Yes	8GB (Q4)
Gemma 3 27B	27B	128K	Yes	18GB (Q4)

All Gemma 3 models support a 128K token context window (except the 1B at 32K), enabling processing of large documents, full codebases, or long conversation histories. This is one of Gemma 3’s standout differentiators versus competitors at similar parameter counts.

Multimodal Capabilities

Gemma 3 4B+ is a vision-language model — it processes images directly in the conversation. Use cases include:

Describing and analyzing photos
Reading text from images (OCR-like functionality)
Answering questions about charts and diagrams
Code screenshot analysis

The vision encoder uses a SigLIP-based architecture. Performance on vision tasks is competitive with LLaVA-Next at similar scales, though specialized vision models still edge it out on benchmarks.

Downloading from Hugging Face

Gemma 3 requires accepting Google’s license terms before downloading:

# Login to Hugging Face
huggingface-cli login

# Download the instruction-tuned 12B model
huggingface-cli download google/gemma-3-12b-it --local-dir ./gemma-3-12b

Available model IDs on Hugging Face:

google/gemma-3-1b-it — instruction-tuned 1B
google/gemma-3-4b-it — instruction-tuned 4B with vision
google/gemma-3-12b-it — instruction-tuned 12B with vision
google/gemma-3-27b-it — instruction-tuned 27B with vision
google/gemma-3-27b-pt — pretrained 27B (no instruction tuning)

Running with Ollama

Ollama provides the simplest local deployment:

# Install Ollama (if not already installed)
curl -fsSL https://ollama.com/install.sh | sh

# Pull Gemma 3 models
ollama pull gemma3:1b
ollama pull gemma3:4b
ollama pull gemma3:12b
ollama pull gemma3:27b

# Run interactively
ollama run gemma3:12b

Image Understanding via Ollama API

# Send an image with a question
curl http://localhost:11434/api/generate -d '{
  "model": "gemma3:12b",
  "prompt": "What is in this image? Describe in detail.",
  "images": ["'"$(base64 -w 0 screenshot.png)"'"]
}'

import ollama
import base64

with open("chart.png", "rb") as f:
    image_data = base64.b64encode(f.read()).decode()

response = ollama.generate(
    model='gemma3:12b',
    prompt='Analyze this chart and summarize the key trends',
    images=[image_data]
)
print(response['response'])

LM Studio Setup

Open LM Studio and navigate to Discover
Search for gemma-3
Select your model size — the 12B Q4_K_M is the recommended starting point for most users
Download and load into the Chat interface
For vision support, use the Chat UI’s image attachment button (paperclip icon)

LM Studio auto-detects the model’s capabilities and enables the image upload UI when you load a vision model.

System Prompt Format

Gemma 3 uses the <start_of_turn> / <end_of_turn> chat template. When using raw llama.cpp or the Transformers library, format prompts correctly:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "google/gemma-3-12b-it"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16
)

messages = [
    {"role": "system", "content": "You are a helpful Python programming assistant."},
    {"role": "user", "content": "Write a decorator that measures function execution time."}
]

prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=500, temperature=0.7)
response = tokenizer.decode(output[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
print(response)

When using Ollama, the chat template is handled automatically — just send standard messages.

Performance Benchmarks vs Llama 3 and Mistral

Gemma 3 27B in particular generated attention for outperforming much larger models on several benchmarks:

Benchmark	Gemma 3 12B	Llama 3 8B	Mistral 7B	Gemma 3 27B
MMLU	74.5%	68.4%	64.2%	81.2%
HumanEval	62.3%	62.2%	40.2%	71.5%
MATH	43.7%	30.0%	22.1%	55.8%
HellaSwag	85.1%	82.0%	81.3%	87.9%

Gemma 3 12B consistently competes with Llama 3 70B on reasoning tasks — a remarkable efficiency gain. For coding specifically, Gemma 3 27B approaches the quality of GPT-4o-mini.

Google’s Licensing Terms

Gemma 3 is governed by the Gemma Terms of Use, not a standard open-source license. Key points:

Commercial use is permitted
Distributing model weights requires complying with Gemma ToU
Cannot use Gemma to train competing foundation models in certain ways
Fine-tuning is allowed; distributing fine-tuned models requires compliance
No restriction on building products or services with Gemma

Read the full terms at ai.google.dev/gemma/terms. For most developers building applications, the terms are permissive enough for commercial deployment.

Best Use Cases for Gemma 3

Gemma 3 1B: Edge deployment on devices with limited compute, always-on assistants, mobile apps.

Gemma 3 4B: Lightweight RAG pipelines, summarization, classification tasks, environments with strict memory limits.

Gemma 3 12B: The sweet spot for most developers — code assistance, document Q&A, image understanding, chat assistants with quality approaching GPT-4 for most tasks.

Gemma 3 27B: Demanding tasks where quality is paramount — complex reasoning, nuanced writing, detailed technical documentation, tasks that require extended context and high accuracy.

The long context window (128K tokens) makes Gemma 3 particularly well-suited for document processing workflows — legal document analysis, code repository understanding, book-length summarization — tasks where other models at similar sizes hit a hard wall.