HeadsUpAI

Google Gemma 4 Adds Manual Resolution Controls for Precise Vision Tasks

Google designed the vision architecture for Gemma 4 to process images with variable aspect ratios and resolutions. Unlike models that force inputs into fixed squares, this system preserves the original shape of tall or wide images. It maintains spatial accuracy using 2D rotary positional embeddings (encoding 2D coordinates).
Aspect ratio support
Variable (native)
Resolution control
Manual visual token budget
Supported token budgets
70, 140, 280, 560, 1120 tokens

This flexibility addresses a common bottleneck in multimodal (AI that understands text and images) performance. By allowing a configurable visual token budget, the model scales its vision from 70 to 1120 tokens. This mirrors Gemma 4's frontier reasoning, letting you prioritize speed or accuracy for complex visual data.

You can now manually set the resolution for specific images to match task requirements. A 70-token budget is ideal for fast classification, while the 1120-token maximum provides the fine-grained detail needed for document analysis. These controls are available alongside Gemma 4 training on Fireworks AI.

Still wondering? A few quick answers below.

Gemma 4 natively supports variable aspect ratios, meaning it can process very wide or very tall images without forcing them into a fixed square shape. This allows the model to preserve the original dimensions of an input image, which is particularly useful for tasks like document analysis or processing panoramic visuals.

Developers can manually configure the resolution of an image by setting a visual token budget. Gemma 4 supports five specific budget sizes: 70, 140, 280, 560, or 1120 visual tokens. A larger budget generates more tokens, providing the large language model with a finer-grained view of the image for detailed tasks.

In Gemma 4, a visual token is a large square representing a specific area of an image. Each token is created by taking multiple smaller patches, generating an embedding for each, and then averaging those embeddings together. This process converts raw visual data into a format the language model can understand and process.

To ensure the model understands the layout of images with variable aspect ratios, Gemma 4 uses 2D rotary positional embeddings. These embeddings act as a coordinate system, allowing the model to know exactly where each visual token is positioned in 2D space, even when the input image is unusually wide or tall.

Share this update