Google Gemma 4 Adds Manual Resolution Controls for Precise Vision Tasks

Google Gemma

Apr 30, 2026

Google's Gemma 4 now supports variable aspect ratios and configurable image resolutions through a manual visual token budget. This allows developers to optimize for speed or detail by choosing between 70 and 1120 tokens per image.

Google designed the vision architecture for Gemma 4 to process images with variable aspect ratios and resolutions. Unlike models that force inputs into fixed squares, this system preserves the original shape of tall or wide images. It maintains spatial accuracy using 2D rotary positional embeddings (encoding 2D coordinates).

Aspect ratio support: Variable (native)
Resolution control: Manual visual token budget
Supported token budgets: 70, 140, 280, 560, 1120 tokens

This flexibility addresses a common bottleneck in multimodal (AI that understands text and images) performance. By allowing a configurable visual token budget, the model scales its vision from 70 to 1120 tokens. This mirrors Gemma 4's frontier reasoning, letting you prioritize speed or accuracy for complex visual data.

You can now manually set the resolution for specific images to match task requirements. A 70-token budget is ideal for fast classification, while the 1120-token maximum provides the fine-grained detail needed for document analysis. These controls are available alongside Gemma 4 training on Fireworks AI.

View the full update on ai.google.dev

Google Gemma

@googlegemmaApr 24

https://t.co/rTAFbP4z2I

18144

View on X

Still wondering? A few quick answers below.

Gemma 4 natively supports variable aspect ratios, meaning it can process very wide or very tall images without forcing them into a fixed square shape. This allows the model to preserve the original dimensions of an input image, which is particularly useful for tasks like document analysis or processing panoramic visuals.

Developers can manually configure the resolution of an image by setting a visual token budget. Gemma 4 supports five specific budget sizes: 70, 140, 280, 560, or 1120 visual tokens. A larger budget generates more tokens, providing the large language model with a finer-grained view of the image for detailed tasks.

In Gemma 4, a visual token is a large square representing a specific area of an image. Each token is created by taking multiple smaller patches, generating an embedding for each, and then averaging those embeddings together. This process converts raw visual data into a format the language model can understand and process.

To ensure the model understands the layout of images with variable aspect ratios, Gemma 4 uses 2D rotary positional embeddings. These embeddings act as a coordinate system, allowing the model to know exactly where each visual token is positioned in 2D space, even when the input image is unusually wide or tall.

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

See all AI news & updates from Google →

Keep reading

Arena Ranks Google Gemma 4 as Top Open Vision Model

Google's Gemma-4-31b and Gemma-4-26b-a4b have entered the Vision Arena leaderboard as the #2 and #4 ranked open models. These releases shift the price-performance frontier by delivering vision reasoning capabilities that rival proprietary systems at a fraction of the cost.

Google Releases Gemma 4 QAT Checkpoints for Efficient On-Device AI

GoogleJun 5

Google Releases Gemma 4 QAT Checkpoints for Efficient On-Device AI

Google released new Gemma 4 Quantization-Aware Training (QAT) checkpoints, including GGUF (Q4_0) and a custom mobile schema under 1GB. These enable running Gemma 4 models locally on consumer GPUs and mobile devices with reduced memory footprint and accelerated decode speeds, while preserving reasoning quality.

Google Releases Gemma 4 Drafter Models to Accelerate Local Inference Speed

Google GemmaMay 5

Google Releases Gemma 4 Drafter Models to Accelerate Local Inference Speed

Google released a series of specialized drafter models that use speculative decoding to significantly increase the inference speed of the Gemma 4 family. By integrating architectural optimizations like shared activations and KV caches, these tiny models allow larger target models to verify multiple tokens in a single parallel pass.

Fireworks AI Adds Gemma 4 Training to Build Custom Reasoning Agents

Fireworks AIApr 28

Fireworks AI Adds Gemma 4 Training to Build Custom Reasoning Agents

Fireworks AI integrated Google's Gemma 4 models into its training platform, enabling full-parameter fine-tuning and DPO with a 256K context window. This allows teams to build specialized reasoning agents on a unified stack that transitions from training to production inference in seconds.

How does Gemma 4 handle different image aspect ratios?

What are the visual token budget options for Gemma 4?

What is a visual token in the Gemma 4 architecture?

How does Gemma 4 maintain spatial awareness for non-square images?

Keep reading

Arena Ranks Google Gemma 4 as Top Open Vision Model

Arena Ranks Google Gemma 4 as Top Open Vision Model

Google Releases Gemma 4 QAT Checkpoints for Efficient On-Device AI

Google Releases Gemma 4 QAT Checkpoints for Efficient On-Device AI

Google Releases Gemma 4 Drafter Models to Accelerate Local Inference Speed

Google Releases Gemma 4 Drafter Models to Accelerate Local Inference Speed

Fireworks AI Adds Gemma 4 Training to Build Custom Reasoning Agents

Fireworks AI Adds Gemma 4 Training to Build Custom Reasoning Agents

Keep reading

Arena Ranks Google Gemma 4 as Top Open Vision Model

Arena Ranks Google Gemma 4 as Top Open Vision Model

Google Releases Gemma 4 QAT Checkpoints for Efficient On-Device AI

Google Releases Gemma 4 QAT Checkpoints for Efficient On-Device AI

Google Releases Gemma 4 Drafter Models to Accelerate Local Inference Speed

Google Releases Gemma 4 Drafter Models to Accelerate Local Inference Speed

Fireworks AI Adds Gemma 4 Training to Build Custom Reasoning Agents

Fireworks AI Adds Gemma 4 Training to Build Custom Reasoning Agents