AI Tools #gemma-3#google#ollama

Running Google Gemma 3 Locally in 2026

Complete guide to running Gemma 3 locally with Ollama and LM Studio. Covers the 1B to 27B model family, multimodal features, benchmarks, and licensing.

7 min read

Google’s Gemma 3, released in early 2025, brought a meaningful leap in open-weight model quality. The model family spans from a 1B parameter edge-deployment model up to a 27B powerhouse that outcompetes models twice its size on several benchmarks. Critically, Gemma 3 4B and above include multimodal capabilities — the ability to understand and reason about images alongside text.

The Gemma 3 Model Family

ModelParametersContext LengthMultimodalRecommended VRAM
Gemma 3 1B1B32KNo2GB (Q4)
Gemma 3 4B4B128KYes4GB (Q4)
Gemma 3 12B12B128KYes8GB (Q4)
Gemma 3 27B27B128KYes18GB (Q4)

All Gemma 3 models support a 128K token context window (except the 1B at 32K), enabling processing of large documents, full codebases, or long conversation histories. This is one of Gemma 3’s standout differentiators versus competitors at similar parameter counts.

Multimodal Capabilities

Gemma 3 4B+ is a vision-language model — it processes images directly in the conversation. Use cases include:

  • Describing and analyzing photos
  • Reading text from images (OCR-like functionality)
  • Answering questions about charts and diagrams
  • Code screenshot analysis

The vision encoder uses a SigLIP-based architecture. Performance on vision tasks is competitive with LLaVA-Next at similar scales, though specialized vision models still edge it out on benchmarks.

Downloading from Hugging Face

Gemma 3 requires accepting Google’s license terms before downloading:

# Login to Hugging Face
huggingface-cli login

# Download the instruction-tuned 12B model
huggingface-cli download google/gemma-3-12b-it --local-dir ./gemma-3-12b

Available model IDs on Hugging Face:

  • google/gemma-3-1b-it — instruction-tuned 1B
  • google/gemma-3-4b-it — instruction-tuned 4B with vision
  • google/gemma-3-12b-it — instruction-tuned 12B with vision
  • google/gemma-3-27b-it — instruction-tuned 27B with vision
  • google/gemma-3-27b-pt — pretrained 27B (no instruction tuning)

Running with Ollama

Ollama provides the simplest local deployment:

# Install Ollama (if not already installed)
curl -fsSL https://ollama.com/install.sh | sh

# Pull Gemma 3 models
ollama pull gemma3:1b
ollama pull gemma3:4b
ollama pull gemma3:12b
ollama pull gemma3:27b

# Run interactively
ollama run gemma3:12b

Image Understanding via Ollama API

# Send an image with a question
curl http://localhost:11434/api/generate -d '{
  "model": "gemma3:12b",
  "prompt": "What is in this image? Describe in detail.",
  "images": ["'"$(base64 -w 0 screenshot.png)"'"]
}'
import ollama
import base64

with open("chart.png", "rb") as f:
    image_data = base64.b64encode(f.read()).decode()

response = ollama.generate(
    model='gemma3:12b',
    prompt='Analyze this chart and summarize the key trends',
    images=[image_data]
)
print(response['response'])

LM Studio Setup

  1. Open LM Studio and navigate to Discover
  2. Search for gemma-3
  3. Select your model size — the 12B Q4_K_M is the recommended starting point for most users
  4. Download and load into the Chat interface
  5. For vision support, use the Chat UI’s image attachment button (paperclip icon)

LM Studio auto-detects the model’s capabilities and enables the image upload UI when you load a vision model.

System Prompt Format

Gemma 3 uses the <start_of_turn> / <end_of_turn> chat template. When using raw llama.cpp or the Transformers library, format prompts correctly:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "google/gemma-3-12b-it"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16
)

messages = [
    {"role": "system", "content": "You are a helpful Python programming assistant."},
    {"role": "user", "content": "Write a decorator that measures function execution time."}
]

prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=500, temperature=0.7)
response = tokenizer.decode(output[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
print(response)

When using Ollama, the chat template is handled automatically — just send standard messages.

Performance Benchmarks vs Llama 3 and Mistral

Gemma 3 27B in particular generated attention for outperforming much larger models on several benchmarks:

BenchmarkGemma 3 12BLlama 3 8BMistral 7BGemma 3 27B
MMLU74.5%68.4%64.2%81.2%
HumanEval62.3%62.2%40.2%71.5%
MATH43.7%30.0%22.1%55.8%
HellaSwag85.1%82.0%81.3%87.9%

Gemma 3 12B consistently competes with Llama 3 70B on reasoning tasks — a remarkable efficiency gain. For coding specifically, Gemma 3 27B approaches the quality of GPT-4o-mini.

Google’s Licensing Terms

Gemma 3 is governed by the Gemma Terms of Use, not a standard open-source license. Key points:

  • Commercial use is permitted
  • Distributing model weights requires complying with Gemma ToU
  • Cannot use Gemma to train competing foundation models in certain ways
  • Fine-tuning is allowed; distributing fine-tuned models requires compliance
  • No restriction on building products or services with Gemma

Read the full terms at ai.google.dev/gemma/terms. For most developers building applications, the terms are permissive enough for commercial deployment.

Best Use Cases for Gemma 3

Gemma 3 1B: Edge deployment on devices with limited compute, always-on assistants, mobile apps.

Gemma 3 4B: Lightweight RAG pipelines, summarization, classification tasks, environments with strict memory limits.

Gemma 3 12B: The sweet spot for most developers — code assistance, document Q&A, image understanding, chat assistants with quality approaching GPT-4 for most tasks.

Gemma 3 27B: Demanding tasks where quality is paramount — complex reasoning, nuanced writing, detailed technical documentation, tasks that require extended context and high accuracy.

The long context window (128K tokens) makes Gemma 3 particularly well-suited for document processing workflows — legal document analysis, code repository understanding, book-length summarization — tasks where other models at similar sizes hit a hard wall.

#multimodal-ai #local-llm #ollama #google #gemma-3