NVIDIA DynoSim Simulates LLM Inference Stacks 1,500x Faster Than Real-Time

NVIDIA

May 30, 2026 · Updated Jun 20, 2026

NVIDIA released DynoSim, a Rust-based simulation framework that creates a digital twin of the Dynamo inference stack to model complex LLM serving workloads. By running 1,500x faster than real-time, the tool allows teams to screen thousands of deployment configurations and autoscaling policies on a laptop before committing expensive GPU resources.

DynoSim is a workload-driven discrete-event simulation of the NVIDIA Dynamo serving stack. Written in Rust, it models the entire inference journey—from routing to KV cache management—on a virtual timeline. This "digital twin" approach enables high-fidelity testing of how different schedulers and hardware configurations interact without burning GPU-hours.

Simulation speed: 1,500x faster than real-time
Modeled components: Router, Planner, KVBM, and Schedulers
Supported engines: vLLM and SGLang
Simulation throughput: 23,608 requests in 2.41 seconds
Metrics tracked: TTFT, TPOT, TPS, and cache reuse

Tuning modern LLM deployments is a massive search problem where local improvements often shift bottlenecks. DynoSim replaces exhaustive hardware testing with a simulate-then-verify loop, mapping the Pareto frontier (the optimal balance of cost and performance). It provides accurate predictions for metrics like Time to First Token by modeling specific engine behaviors.

You can use the simulator to optimize autoscaling intervals or quantify how Dynamo Snapshot's cold-start reductions impact traffic bursts. The framework also supports an autoresearch loop where AI agents propose and score algorithmic changes to routers. Technical guides for the mocker replay and planner components are now available.

View the full update on developer.nvidia.com

NVIDIA AI

@NVIDIAAIMay 30

There's a better way to serve your inference stack, you just haven't found it yet. DynoSim is a workload-driven simulation of the Dynamo serving stack that turns exhaustive deployment search into a simulate-then-verify loop. Instead of testing every deployment choice, teams can model the whole stack on one virtual timeline, screen thousands of configurations in high fidelity simulation, then validate only the best candidates on real hardware. And because it's a full Rust implementation, it runs extremely fast. In our testing, 1,500x faster than real time.

127

View on X

Still wondering? A few quick answers below.

DynoSim is a workload-driven simulation tool designed to act as a digital twin for the NVIDIA Dynamo inference stack. It uses discrete-event simulation to model how large language models perform under different serving configurations. By simulating the entire stack on a virtual timeline, it helps developers find the most efficient deployment settings without using physical hardware.

DynoSim is implemented in Rust and runs significantly faster than real-world execution. In NVIDIA testing, it simulated a 60-minute workload trace containing over 23,000 requests in just 2.41 seconds on a laptop. This performance represents a speedup of roughly 1,500x over real-time testing, allowing teams to screen thousands of different configurations in minutes.

The simulator models the interaction between several critical serving components, including the request router, the autoscaling planner, and the KV cache management system. It also features scheduler-aware engine simulations for backends like vLLM and SGLang. This level of detail allows it to accurately predict metrics such as time to first token and total throughput.

DynoSim allows teams to map the Pareto frontier, which is the optimal trade-off between serving costs and performance latency. By identifying the best tensor-parallel shapes and autoscaling thresholds in simulation first, organizations can avoid the high expense of running trial-and-error experiments on real GPU clusters, ultimately reducing the total GPU-hours required for production.

Yes, DynoSim supports an autoresearch loop where an agentic harness can propose code changes to serving components. Because the simulation is so fast, it can act as a scoring function to verify if a new routing cost function or cache policy actually improves performance before any code is deployed to a live cluster.

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

See all AI news & updates from NVIDIA →

Keep reading

NVIDIA Dynamo Snapshot Cuts AI Inference Startup Times to Under Five Seconds

NVIDIA announced Dynamo Snapshot, a framework that reduces Kubernetes inference startup times from minutes to under five seconds. By decoupling model weights from process state, the system enables near-instant elastic scaling for large language models without maintaining expensive, idle GPU capacity.

RunwayMar 20

Runway Unveils Real-Time Video Model Built with NVIDIA Hardware

Runway shared a research preview of a real-time video generation model developed with NVIDIA, running on Vera Rubin hardware. HD video generates instantly — time-to-first-frame under 100ms — opening a fundamentally different design space for video generation and world simulation.

LangChain Adds NVIDIA Nemotron 3 Ultra for Faster AI Agents

LangChainJun 7

LangChain Adds NVIDIA Nemotron 3 Ultra for Faster AI Agents

LangChain announced immediate support for NVIDIA Nemotron 3 Ultra, an open frontier model designed for long-running AI agents. This integration makes the model's 5x faster inference and up to 30% lower cost for complex agentic tasks directly available to developers using the LangChain framework.

What is NVIDIA DynoSim?

How fast does NVIDIA DynoSim run compared to real-time testing?

What specific components of the inference stack does DynoSim model?

How does DynoSim help optimize LLM deployment costs?

Can DynoSim be used to test new AI serving algorithms?

Keep reading

NVIDIA Dynamo Snapshot Cuts AI Inference Startup Times to Under Five Seconds

NVIDIA Dynamo Snapshot Cuts AI Inference Startup Times to Under Five Seconds

Runway Unveils Real-Time Video Model Built with NVIDIA Hardware

Runway Unveils Real-Time Video Model Built with NVIDIA Hardware

LangChain Adds NVIDIA Nemotron 3 Ultra for Faster AI Agents

LangChain Adds NVIDIA Nemotron 3 Ultra for Faster AI Agents

Keep reading

NVIDIA Dynamo Snapshot Cuts AI Inference Startup Times to Under Five Seconds

NVIDIA Dynamo Snapshot Cuts AI Inference Startup Times to Under Five Seconds

Runway Unveils Real-Time Video Model Built with NVIDIA Hardware

Runway Unveils Real-Time Video Model Built with NVIDIA Hardware

LangChain Adds NVIDIA Nemotron 3 Ultra for Faster AI Agents

LangChain Adds NVIDIA Nemotron 3 Ultra for Faster AI Agents