Google Releases Gemma 4 Drafter Models to Accelerate Local Inference Speed

Google Gemma

May 5, 2026

Google released a series of specialized drafter models that use speculative decoding to significantly increase the inference speed of the Gemma 4 family. By integrating architectural optimizations like shared activations and KV caches, these tiny models allow larger target models to verify multiple tokens in a single parallel pass.

Google released specialized "drafter" models for the Gemma 4 family, implementing an optimized version of speculative decoding (small models predicting tokens for larger models to verify). These drafters, like the 76M-parameter version for Gemma 4 E2B, act as Multi-Token Prediction heads that generate sequences for parallel verification.

Drafter size (E2B): 76M parameters
Drafter layers (E2B): 4 layers
Input embedding size: 256
Vocabulary size: 262,144 tokens
Optimizations: KV cache sharing, Target activation sharing, Efficient Embedder

This release addresses latency bottlenecks in Google's Gemma 4 models by moving beyond generic speculative decoding. These drafters recycle the target model's final activations and cross-attend to its existing KV cache. This architectural coupling maximizes speed gains without the memory overhead of maintaining separate state histories.

You can now deploy these drafter checkpoints for high-speed, low-latency generation in on-device applications. The release includes an "Efficient Embedder" for the E2B and E4B variants, which uses token clustering to reduce prediction compute. These open-weight checkpoints follow the Gemini CLI local integration roadmap for private local execution.

View the full update on blog.google

Google Gemma

@googlegemmaMay 5

https://t.co/BvHkG5TaBF

1511k

View on X

Still wondering? A few quick answers below.

Gemma 4 drafter models are tiny, specialized versions of the main Gemma 4 lineup designed to accelerate text generation. These models, such as the 76M-parameter version for Gemma 4 E2B, are much smaller than the primary target models. They work alongside the larger models to predict multiple tokens quickly, which the larger model then verifies.

These models use speculative decoding to predict several tokens in the time it takes a larger model to process one. Instead of the large model generating every token sequentially, it verifies the drafter's suggestions in parallel. This reduces the number of forward passes the target model must perform, significantly increasing overall decoding speed.

Multi-Token Prediction, or MTP, refers to the drafter's ability to generate a sequence of tokens for verification rather than just one. The drafter uses the hidden states produced by the target model's forward pass to run its own fast, autoregressive predictions. This allows a single pass of the target model to result in multiple accepted tokens.

To save memory and compute, Gemma 4 drafter models do not build their own Key-Value cache, which stores previous token representations. Instead, they cross-attend to the target model's existing KV cache. By reusing these pre-computed representations for local and global attention layers, the drafter avoids redundant processing and operates with much lower latency.

The Efficient Embedder is a technique used in the E2B and E4B drafter models to reduce the compute needed for token prediction. It uses clustering to group similar token embeddings together. The model first predicts the most likely clusters for the next token and then only calculates probabilities for tokens within those specific groups, speeding up the process.

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

See all AI news & updates from Google →

Keep reading

Google Releases Gemma 4 QAT Checkpoints for Efficient On-Device AI

Google released new Gemma 4 Quantization-Aware Training (QAT) checkpoints, including GGUF (Q4_0) and a custom mobile schema under 1GB. These enable running Gemma 4 models locally on consumer GPUs and mobile devices with reduced memory footprint and accelerated decode speeds, while preserving reasoning quality.

Google GemmaMay 1

Google Gemini CLI Integrates Local Gemma Models for Intelligent Task Routing

Gemini CLI v0.40.0 introduces experimental support for running Gemma models locally to handle intelligent routing decisions. By offloading intent analysis to the user's hardware, the agent reduces cloud API dependency and latency for simple tasks. This marks the first step toward a roadmap of full local execution for Google's terminal-based agent.

Ollama Adds Google DeepMind's Gemma 4 12B for Local Agentic AI

OllamaJun 7

Ollama Adds Google DeepMind's Gemma 4 12B for Local Agentic AI

Ollama has made Google DeepMind's Gemma 4 12B model available for local execution, including support for chat and agentic applications. This expands access to a powerful, open-weight multimodal model optimized for on-device reasoning and coding, enabling private and offline AI workflows on consumer hardware.

Fireworks AI Adds Gemma 4 Training to Build Custom Reasoning Agents

Fireworks AIApr 28

Fireworks AI Adds Gemma 4 Training to Build Custom Reasoning Agents

Fireworks AI integrated Google's Gemma 4 models into its training platform, enabling full-parameter fine-tuning and DPO with a 256K context window. This allows teams to build specialized reasoning agents on a unified stack that transitions from training to production inference in seconds.

What are Gemma 4 drafter models?

How do Gemma 4 drafter models speed up inference?

What is Multi-Token Prediction in Gemma 4?

How does KV cache sharing work in Gemma 4 drafters?

What is the Efficient Embedder in Gemma 4?

Keep reading

Google Releases Gemma 4 QAT Checkpoints for Efficient On-Device AI

Google Releases Gemma 4 QAT Checkpoints for Efficient On-Device AI

Google Gemini CLI Integrates Local Gemma Models for Intelligent Task Routing

Google Gemini CLI Integrates Local Gemma Models for Intelligent Task Routing

Ollama Adds Google DeepMind's Gemma 4 12B for Local Agentic AI

Ollama Adds Google DeepMind's Gemma 4 12B for Local Agentic AI

Fireworks AI Adds Gemma 4 Training to Build Custom Reasoning Agents

Fireworks AI Adds Gemma 4 Training to Build Custom Reasoning Agents

Keep reading

Google Releases Gemma 4 QAT Checkpoints for Efficient On-Device AI

Google Releases Gemma 4 QAT Checkpoints for Efficient On-Device AI

Google Gemini CLI Integrates Local Gemma Models for Intelligent Task Routing

Google Gemini CLI Integrates Local Gemma Models for Intelligent Task Routing

Ollama Adds Google DeepMind's Gemma 4 12B for Local Agentic AI

Ollama Adds Google DeepMind's Gemma 4 12B for Local Agentic AI

Fireworks AI Adds Gemma 4 Training to Build Custom Reasoning Agents

Fireworks AI Adds Gemma 4 Training to Build Custom Reasoning Agents