NVIDIA Adds Day-One Support for Google DeepMind's DiffusionGemma Model

NVIDIA

Jun 13, 2026

NVIDIA announced day-one support for Google DeepMind's DiffusionGemma, an experimental model that generates 256 tokens in parallel per step. BF16 and NVFP4 checkpoints are available on Hugging Face, alongside free GPU-accelerated endpoints and vLLM deployment. The model delivers over 150 tokens per second on DGX Spark and up to 1,000 on a single H100 GPU.

View the full update on developer.nvidia.com

NVIDIA AI

@NVIDIAAI3d ago

Congrats to @GoogleDeepMind on the launch of DiffusionGemma. The model generates 256 tokens in parallel per step, delivering 150+ TPS on DGX Spark, and 1,000+ TPS on a single H100. We're supporting it from day one with: • BF16 and NVFP4 checkpoints on @huggingface🤗 • Free GPU-accelerated endpoints on https://t.co/6T0R9P7EXS • @vllm_project support with FP8 precision Get started with DiffusionGemma on NVIDIA: https://t.co/vurk7GCQUs

1181.4k

View on X

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

Keep reading

Google Releases DiffusionGemma for 4x Faster Parallel Text Generation

Google released DiffusionGemma, an experimental open model that generates text using diffusion instead of sequential token prediction. By generating 256 tokens in parallel, it delivers up to 4x faster inference on dedicated GPUs, exceeding 1000 tokens per second on an H100. This 26B Mixture of Experts model supports real-time self-correction for tasks like code infilling and in-line editing.

NVIDIAMay 20

NVIDIA Releases Nemotron-Labs-Diffusion for 6x Faster Parallel Token Generation

NVIDIA released Nemotron-Labs-Diffusion, a family of open-weight models that unify standard autoregressive decoding with parallel diffusion-based generation. By switching attention patterns within a single model, these 3B to 14B parameter models achieve up to 4x higher throughput on modern hardware compared to traditional sequential generation.

Google Releases Gemma 4 QAT Checkpoints for Efficient On-Device AI

GoogleJun 5

Google Releases Gemma 4 QAT Checkpoints for Efficient On-Device AI

Google released new Gemma 4 Quantization-Aware Training (QAT) checkpoints, including GGUF (Q4_0) and a custom mobile schema under 1GB. These enable running Gemma 4 models locally on consumer GPUs and mobile devices with reduced memory footprint and accelerated decode speeds, while preserving reasoning quality.

Ollama Adds Google DeepMind's Gemma 4 12B for Local Agentic AI

Ollama6d ago

Ollama Adds Google DeepMind's Gemma 4 12B for Local Agentic AI

Ollama has made Google DeepMind's Gemma 4 12B model available for local execution, including support for chat and agentic applications. This expands access to a powerful, open-weight multimodal model optimized for on-device reasoning and coding, enabling private and offline AI workflows on consumer hardware.