https://t.co/BvHkG5TaBF
Google Releases Gemma 4 Drafter Models to Accelerate Local Inference Speed
Google released specialized "drafter" models for the Gemma 4 family, implementing an optimized version of speculative decoding (small models predicting tokens for larger models to verify). These drafters, like the 76M-parameter version for Gemma 4 E2B, act as Multi-Token Prediction heads that generate sequences for parallel verification.
- Drafter size (E2B)
- 76M parameters
- Drafter layers (E2B)
- 4 layers
- Input embedding size
- 256
- Vocabulary size
- 262,144 tokens
- Optimizations
- KV cache sharing, Target activation sharing, Efficient Embedder
This release addresses latency bottlenecks in Google's Gemma 4 models by moving beyond generic speculative decoding. These drafters recycle the target model's final activations and cross-attend to its existing KV cache. This architectural coupling maximizes speed gains without the memory overhead of maintaining separate state histories.
You can now deploy these drafter checkpoints for high-speed, low-latency generation in on-device applications. The release includes an "Efficient Embedder" for the E2B and E4B variants, which uses token clustering to reduce prediction compute. These open-weight checkpoints follow the Gemini CLI local integration roadmap for private local execution.
Google Gemma
@googlegemma
151retweets1klikes
View on XStill wondering? A few quick answers below.
Gemma 4 drafter models are tiny, specialized versions of the main Gemma 4 lineup designed to accelerate text generation. These models, such as the 76M-parameter version for Gemma 4 E2B, are much smaller than the primary target models. They work alongside the larger models to predict multiple tokens quickly, which the larger model then verifies.
These models use speculative decoding to predict several tokens in the time it takes a larger model to process one. Instead of the large model generating every token sequentially, it verifies the drafter's suggestions in parallel. This reduces the number of forward passes the target model must perform, significantly increasing overall decoding speed.
Multi-Token Prediction, or MTP, refers to the drafter's ability to generate a sequence of tokens for verification rather than just one. The drafter uses the hidden states produced by the target model's forward pass to run its own fast, autoregressive predictions. This allows a single pass of the target model to result in multiple accepted tokens.
To save memory and compute, Gemma 4 drafter models do not build their own Key-Value cache, which stores previous token representations. Instead, they cross-attend to the target model's existing KV cache. By reusing these pre-computed representations for local and global attention layers, the drafter avoids redundant processing and operates with much lower latency.
The Efficient Embedder is a technique used in the E2B and E4B drafter models to reduce the compute needed for token prediction. It uses clustering to group similar token embeddings together. The model first predicts the most likely clusters for the next token and then only calculates probabilities for tokens within those specific groups, speeding up the process.




