HeadsUpAI

NVIDIA DynoSim Simulates LLM Inference Stacks 1,500x Faster Than Real-Time

DynoSim is a workload-driven discrete-event simulation of the NVIDIA Dynamo serving stack. Written in Rust, it models the entire inference journey—from routing to KV cache management—on a virtual timeline. This "digital twin" approach enables high-fidelity testing of how different schedulers and hardware configurations interact without burning GPU-hours.
Simulation speed
1,500x faster than real-time
Modeled components
Router, Planner, KVBM, and Schedulers
Supported engines
vLLM and SGLang
Simulation throughput
23,608 requests in 2.41 seconds
Metrics tracked
TTFT, TPOT, TPS, and cache reuse

Tuning modern LLM deployments is a massive search problem where local improvements often shift bottlenecks. DynoSim replaces exhaustive hardware testing with a simulate-then-verify loop, mapping the Pareto frontier (the optimal balance of cost and performance). It provides accurate predictions for metrics like Time to First Token by modeling specific engine behaviors.

You can use the simulator to optimize autoscaling intervals or quantify how Dynamo Snapshot's cold-start reductions impact traffic bursts. The framework also supports an autoresearch loop where AI agents propose and score algorithmic changes to routers. Technical guides for the mocker replay and planner components are now available.

NVIDIA AI
NVIDIA AI
@NVIDIAAI
X

There's a better way to serve your inference stack, you just haven't found it yet. DynoSim is a workload-driven simulation of the Dynamo serving stack that turns exhaustive deployment search into a simulate-then-verify loop. Instead of testing every deployment choice, teams can model the whole stack on one virtual timeline, screen thousands of configurations in high fidelity simulation, then validate only the best candidates on real hardware. And because it's a full Rust implementation, it runs extremely fast. In our testing, 1,500x faster than real time.

1retweets27likes
View on X

Still wondering? A few quick answers below.

DynoSim is a workload-driven simulation tool designed to act as a digital twin for the NVIDIA Dynamo inference stack. It uses discrete-event simulation to model how large language models perform under different serving configurations. By simulating the entire stack on a virtual timeline, it helps developers find the most efficient deployment settings without using physical hardware.

DynoSim is implemented in Rust and runs significantly faster than real-world execution. In NVIDIA testing, it simulated a 60-minute workload trace containing over 23,000 requests in just 2.41 seconds on a laptop. This performance represents a speedup of roughly 1,500x over real-time testing, allowing teams to screen thousands of different configurations in minutes.

The simulator models the interaction between several critical serving components, including the request router, the autoscaling planner, and the KV cache management system. It also features scheduler-aware engine simulations for backends like vLLM and SGLang. This level of detail allows it to accurately predict metrics such as time to first token and total throughput.

DynoSim allows teams to map the Pareto frontier, which is the optimal trade-off between serving costs and performance latency. By identifying the best tensor-parallel shapes and autoscaling thresholds in simulation first, organizations can avoid the high expense of running trial-and-error experiments on real GPU clusters, ultimately reducing the total GPU-hours required for production.

Yes, DynoSim supports an autoresearch loop where an agentic harness can propose code changes to serving components. Because the simulation is so fast, it can act as a scoring function to verify if a new routing cost function or cache policy actually improves performance before any code is deployed to a live cluster.

Share this update