What does it actually take to run agentic workloads at scale? ⚡Agents push token consumption, context length, and latency into extremely demanding regions. Extreme co-design on the Vera Rubin platform is built for these complex workloads, delivering 400+ tokens/sec/user on trillion-parameter MoE models. Tech blog ➡️ https://t.co/DIxW96omML
NVIDIA Vera Rubin Hits 400 Tokens Per Second for Trillion Parameter Models
· Updated
- Throughput
- 400+ tokens per second per user
- Model capacity
- Trillion-parameter MoE models
- Context window
- 400K tokens
- Cost reduction
- 10x lower than Blackwell
- Prompt caching efficiency
- Up to 85% cost reduction
Agentic systems increase token consumption by 15x as agents re-read context and spawn sub-agents. Traditional hardware forces a trade-off between throughput and latency, making agents economically unviable. By using Vera Rubin's extreme co-design, NVIDIA reduces token costs to one-tenth that of Blackwell architecture.
You can now architect agentic workflows that utilize massive context windows without the performance degradation of context rot. The stack leverages Dynamo's inference optimization to manage KV cache offloading and context compaction. These capabilities are currently being integrated into frontier systems like Claude Code to support multi-step autonomous sessions.
Still wondering? A few quick answers below.



