Google Releases DiffusionGemma for 4x Faster Parallel Text Generation

Google Gemma

Jun 13, 2026

Google released DiffusionGemma, an experimental open model that generates text using diffusion instead of sequential token prediction. By generating 256 tokens in parallel, it delivers up to 4x faster inference on dedicated GPUs, exceeding 1000 tokens per second on an H100. This 26B Mixture of Experts model supports real-time self-correction for tasks like code infilling and in-line editing.

View the full update on blog.google

Google Gemma

@googlegemma3d ago

Meet DiffusionGemma! An experimental open model that explores a fast approach to text generation, released under an Apache 2.0 license. Moving beyond sequential, token-by-token processes to generate entire blocks of text simultaneously. Here’s what’s new with DiffusionGemma: 👇 https://t.co/iaVMPr0WKx

8055k

View on X

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

Keep reading

NVIDIA Adds Day-One Support for Google DeepMind's DiffusionGemma Model

NVIDIA announced day-one support for Google DeepMind's DiffusionGemma, an experimental model that generates 256 tokens in parallel per step. BF16 and NVFP4 checkpoints are available on Hugging Face, alongside free GPU-accelerated endpoints and vLLM deployment. The model delivers over 150 tokens per second on DGX Spark and up to 1,000 on a single H100 GPU.

Google Releases Gemma 4 Drafter Models to Accelerate Local Inference Speed

Google GemmaMay 5

Google Releases Gemma 4 Drafter Models to Accelerate Local Inference Speed

Google released a series of specialized drafter models that use speculative decoding to significantly increase the inference speed of the Gemma 4 family. By integrating architectural optimizations like shared activations and KV caches, these tiny models allow larger target models to verify multiple tokens in a single parallel pass.

GoogleApr 27

Google Optimizes Gemma 4 for High Concurrency Serving on Single GPUs

Google demonstrated that the Gemma 4 26B A4B model can handle more than 10 concurrent sessions on a single GPU without performance bottlenecks. This optimization allows developers to serve high-quality reasoning models at significantly lower hardware costs for multi-user or agentic workflows.

Ollama Adds Google DeepMind's Gemma 4 12B for Local Agentic AI

Ollama6d ago

Ollama Adds Google DeepMind's Gemma 4 12B for Local Agentic AI

Ollama has made Google DeepMind's Gemma 4 12B model available for local execution, including support for chat and agentic applications. This expands access to a powerful, open-weight multimodal model optimized for on-device reasoning and coding, enabling private and offline AI workflows on consumer hardware.