HeadsUpAI

NVIDIA Vera Rubin Hits 400 Tokens Per Second for Trillion Parameter Models

· Updated

NVIDIA detailed the architecture of its Vera Rubin platform, a system designed for unpredictable workloads of autonomous AI agents. The platform integrates seven specialized chips—including the Rubin GPU, the Vera CPU, and the Groq 3 LPX—to deliver 400+ tokens per second on trillion-parameter models with 400k-token context windows.
Throughput
400+ tokens per second per user
Model capacity
Trillion-parameter MoE models
Context window
400K tokens
Cost reduction
10x lower than Blackwell
Prompt caching efficiency
Up to 85% cost reduction

Agentic systems increase token consumption by 15x as agents re-read context and spawn sub-agents. Traditional hardware forces a trade-off between throughput and latency, making agents economically unviable. By using Vera Rubin's extreme co-design, NVIDIA reduces token costs to one-tenth that of Blackwell architecture.

You can now architect agentic workflows that utilize massive context windows without the performance degradation of context rot. The stack leverages Dynamo's inference optimization to manage KV cache offloading and context compaction. These capabilities are currently being integrated into frontier systems like Claude Code to support multi-step autonomous sessions.

NVIDIA AI
NVIDIA AI
@NVIDIAAI
X

What does it actually take to run agentic workloads at scale? ⚡Agents push token consumption, context length, and latency into extremely demanding regions. Extreme co-design on the Vera Rubin platform is built for these complex workloads, delivering 400+ tokens/sec/user on trillion-parameter MoE models. Tech blog ➡️ https://t.co/DIxW96omML

17retweets107likes
View on X

Still wondering? A few quick answers below.

The NVIDIA Vera Rubin platform is a full-stack AI supercomputer architecture designed specifically for the high-token demands of autonomous agents. It integrates seven specialized chips, including the Rubin GPU and Vera CPU, into a co-designed system that optimizes inference for trillion-parameter models and large context windows of up to 400,000 tokens.

Agentic workloads are structurally unpredictable, requiring frequent tool calls and context re-reading. The platform uses extreme co-design to split these tasks across specialized hardware. The Vera CPU handles tool execution and KV cache management, while the Groq 3 LPX provides low-jitter token generation, ensuring the entire agentic loop remains fast and interactive.

The Vera Rubin platform delivers over 400 tokens per second per user when running trillion-parameter Mixture-of-Experts models. This performance level is achieved even with large 400,000-token context windows. By maintaining high throughput at low latency, the system makes complex multi-agent loops and real-time autonomous reasoning viable for large-scale production environments.

The platform reduces the cost per million tokens to one-tenth that of the previous Blackwell architecture. It achieves this through hardware-software co-design, utilizing techniques like prompt caching to reuse previous context and context compaction to reduce memory pressure. These optimizations allow providers to sustain high interactivity without the prohibitive costs usually associated with long-context agentic sessions.

The Groq 3 LPX is a low-latency inference accelerator integrated into the Vera Rubin platform to break the traditional trade-off between throughput and latency. Its SRAM-first architecture provides tightly bounded token generation speeds. This prevents variance in one agent from slowing down the entire multi-agent pipeline, which is critical for maintaining consistent user interactivity.

Share this update