NVIDIA Vera Rubin Hits 400 Tokens Per Second for Trillion Parameter Models

NVIDIA

May 6, 2026 · Updated May 14, 2026

NVIDIA's Vera Rubin platform uses a co-designed stack of seven specialized chips to solve the high-latency and cost bottlenecks of autonomous AI agents. By integrating dedicated hardware for token generation and tool execution, the system maintains high interactivity for trillion-parameter models while reducing token costs by 90 percent compared to previous architectures.

NVIDIA detailed the architecture of its Vera Rubin platform, a system designed for unpredictable workloads of autonomous AI agents. The platform integrates seven specialized chips—including the Rubin GPU, the Vera CPU, and the Groq 3 LPX—to deliver 400+ tokens per second on trillion-parameter models with 400k-token context windows.

Throughput: 400+ tokens per second per user
Model capacity: Trillion-parameter MoE models
Context window: 400K tokens
Cost reduction: 10x lower than Blackwell
Prompt caching efficiency: Up to 85% cost reduction

Agentic systems increase token consumption by 15x as agents re-read context and spawn sub-agents. Traditional hardware forces a trade-off between throughput and latency, making agents economically unviable. By using Vera Rubin's extreme co-design, NVIDIA reduces token costs to one-tenth that of Blackwell architecture.

You can now architect agentic workflows that utilize massive context windows without the performance degradation of context rot. The stack leverages Dynamo's inference optimization to manage KV cache offloading and context compaction. These capabilities are currently being integrated into frontier systems like Claude Code to support multi-step autonomous sessions.

View the full update on developer.nvidia.com

NVIDIA AI

@NVIDIAAIMay 5

What does it actually take to run agentic workloads at scale? ⚡Agents push token consumption, context length, and latency into extremely demanding regions. Extreme co-design on the Vera Rubin platform is built for these complex workloads, delivering 400+ tokens/sec/user on trillion-parameter MoE models. Tech blog ➡️ https://t.co/DIxW96omML

17107

View on X

Still wondering? A few quick answers below.

The NVIDIA Vera Rubin platform is a full-stack AI supercomputer architecture designed specifically for the high-token demands of autonomous agents. It integrates seven specialized chips, including the Rubin GPU and Vera CPU, into a co-designed system that optimizes inference for trillion-parameter models and large context windows of up to 400,000 tokens.

Agentic workloads are structurally unpredictable, requiring frequent tool calls and context re-reading. The platform uses extreme co-design to split these tasks across specialized hardware. The Vera CPU handles tool execution and KV cache management, while the Groq 3 LPX provides low-jitter token generation, ensuring the entire agentic loop remains fast and interactive.

The Vera Rubin platform delivers over 400 tokens per second per user when running trillion-parameter Mixture-of-Experts models. This performance level is achieved even with large 400,000-token context windows. By maintaining high throughput at low latency, the system makes complex multi-agent loops and real-time autonomous reasoning viable for large-scale production environments.

The platform reduces the cost per million tokens to one-tenth that of the previous Blackwell architecture. It achieves this through hardware-software co-design, utilizing techniques like prompt caching to reuse previous context and context compaction to reduce memory pressure. These optimizations allow providers to sustain high interactivity without the prohibitive costs usually associated with long-context agentic sessions.

The Groq 3 LPX is a low-latency inference accelerator integrated into the Vera Rubin platform to break the traditional trade-off between throughput and latency. Its SRAM-first architecture provides tightly bounded token generation speeds. This prevents variance in one agent from slowing down the entire multi-agent pipeline, which is critical for maintaining consistent user interactivity.

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

See all AI news & updates from NVIDIA →

Keep reading

NVIDIA Launches Vera Rubin Platform With Seven Chips for AI Factories

NVIDIA announced the Vera Rubin platform at GTC, putting seven new chips into full production for large-scale AI infrastructure. The NVL72 rack trains mixture-of-experts models with one-fourth the GPUs compared with Blackwell while delivering 10x inference throughput per watt.

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

QwenMay 27

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

Qwen achieved a record 580 tokens per second running its Qwen3.5-397B-A17B model on NVIDIA Blackwell GPUs using the TokenSpeed inference engine. The optimization targets agentic workloads, where multi-turn reasoning and tool-calling typically suffer from high latency. By combining a hybrid attention architecture with deep kernel fusion, the system maintains high throughput even as context scales to one million tokens.

NVIDIA Nemotron 3 Ultra Claims Top US Open Weights Intelligence Spot

Artificial AnalysisJun 1

NVIDIA Nemotron 3 Ultra Claims Top US Open Weights Intelligence Spot

NVIDIA released Nemotron 3 Ultra, a 550B-parameter model that leads US open-weights benchmarks with an intelligence score of 48. The model delivers high-throughput performance exceeding 300 tokens per second, significantly outpacing similarly sized frontier models from China.

Ollama Adds NVIDIA Nemotron 3 Ultra for Faster, Cheaper AI Agents

OllamaJun 7

Ollama Adds NVIDIA Nemotron 3 Ultra for Faster, Cheaper AI Agents

Ollama has made NVIDIA's Nemotron 3 Ultra model available on its cloud. This 550 billion parameter Mixture of Experts (MoE) model is designed for long-running AI agents, delivering 5x faster inference and up to 30% lower costs for complex agentic tasks.

What is the NVIDIA Vera Rubin platform?

How does the Vera Rubin platform handle agentic AI workloads?

What is the performance of the Vera Rubin platform on trillion-parameter models?

How does the Vera Rubin platform reduce the cost of AI inference?

What is the role of the Groq 3 LPX in the Vera Rubin platform?

Keep reading

NVIDIA Launches Vera Rubin Platform With Seven Chips for AI Factories

NVIDIA Launches Vera Rubin Platform With Seven Chips for AI Factories

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

NVIDIA Nemotron 3 Ultra Claims Top US Open Weights Intelligence Spot

NVIDIA Nemotron 3 Ultra Claims Top US Open Weights Intelligence Spot

Ollama Adds NVIDIA Nemotron 3 Ultra for Faster, Cheaper AI Agents

Ollama Adds NVIDIA Nemotron 3 Ultra for Faster, Cheaper AI Agents

Keep reading

NVIDIA Launches Vera Rubin Platform With Seven Chips for AI Factories

NVIDIA Launches Vera Rubin Platform With Seven Chips for AI Factories

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

NVIDIA Nemotron 3 Ultra Claims Top US Open Weights Intelligence Spot

NVIDIA Nemotron 3 Ultra Claims Top US Open Weights Intelligence Spot

Ollama Adds NVIDIA Nemotron 3 Ultra for Faster, Cheaper AI Agents

Ollama Adds NVIDIA Nemotron 3 Ultra for Faster, Cheaper AI Agents