NVIDIA Dynamo Snapshot Cuts AI Inference Startup Times to Under Five Seconds

NVIDIA

May 28, 2026 · Updated Jun 12, 2026

NVIDIA announced Dynamo Snapshot, a framework that reduces Kubernetes inference startup times from minutes to under five seconds. By decoupling model weights from process state, the system enables near-instant elastic scaling for large language models without maintaining expensive, idle GPU capacity.

NVIDIA announced Dynamo Snapshot, a framework using checkpointing (saving a running process's state) to eliminate cold-start delays in Kubernetes. It combines CRIU for host-side state with a cuda-checkpoint tool to serialize GPU memory. This allows a fully warmed inference worker to resume execution nearly instantly on any node in a cluster.

Startup time: Under 5 seconds
Startup speedup (120B model): 21x
Snapshot size (Qwen3-0.6B): 6.2 GiB
Snapshot size (gpt-oss-120b): 129 GiB
Supported engines: vLLM, SGLang

AI demand is elastic, but the minutes required to load weights often cause SLA violations during spikes. This update extends the NVIDIA Dynamo 1.0 inference OS to address the infrastructure bottleneck that forced providers to over-provision hardware. By unmapping the KV cache, it makes rapid, serverless-style scaling economically viable.

The experimental release supports single-GPU workloads running vLLM or SGLang. Future updates will add support for TensorRT-LLM and multi-node clusters. The snapshot agent is available via Helm to help you implement sub-5-second auto-scaling, with core CRIU optimizations currently pending an upstream merge.

View the full update on developer.nvidia.com

NVIDIA AI

@NVIDIAAIMay 27

Introducing Dynamo Snapshot, our approach for fast startup for inference workloads on Kubernetes, which reduces startup time from minutes to under 5 seconds. In production inference deployments demand fluctuates over time. Cold-starting inference workloads can take minutes, leaving idle GPUs that generate no tokens and serve no requests. Snapshot leverages GMS to enable concurrent weight restoration over a high-speed interconnect, while using Linux native AIO and parallel memfd restoration to accelerate CRIU restore performance.

36265

View on X

Still wondering? A few quick answers below.

NVIDIA Dynamo Snapshot is a checkpoint and restore framework designed to eliminate cold-start latency for AI inference on Kubernetes. It works by freezing the state of a running inference worker, including both CPU and GPU memory, and saving it as an artifact that can be restored nearly instantly on any node in a cluster.

The system uses a technique called GPU Memory Service to decouple heavy model weights from the process state. By restoring weights and process data concurrently over high-speed channels like GPUDirect Storage, it avoids the slow, serial loading process of traditional cold starts. It also optimizes memory by unmapping the KV cache before saving the snapshot.

For large-scale models like gpt-oss-120b, Dynamo Snapshot can reduce startup times from several minutes to under five seconds, representing a 21x speedup. This allows production environments to scale elastically to meet traffic spikes without keeping expensive GPUs idle, significantly improving hardware utilization and reducing the risk of service level agreement violations.

The current experimental release of Dynamo Snapshot supports single-GPU workloads using the vLLM and SGLang inference engines. NVIDIA is currently working to expand this support to include TensorRT-LLM, as well as adding capabilities for multi-GPU and multi-node clusters through specialized quiesce and resume hooks for distributed runtimes like NCCL and PyTorch.

Dynamo Snapshot is being rolled out incrementally as an experimental release. Developers can currently access the snapshot agent as a privileged DaemonSet installable via a Helm chart. While some core memory optimizations are still pending an upstream merge into the CRIU project, the agent is fully portable across different Kubernetes environments and cloud providers.

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

See all AI news & updates from NVIDIA →

Keep reading

NVIDIA DynoSim Simulates LLM Inference Stacks 1,500x Faster Than Real-Time

NVIDIA released DynoSim, a Rust-based simulation framework that creates a digital twin of the Dynamo inference stack to model complex LLM serving workloads. By running 1,500x faster than real-time, the tool allows teams to screen thousands of deployment configurations and autoscaling policies on a laptop before committing expensive GPU resources.

LangChain Adds NVIDIA Nemotron 3 Ultra for Faster AI Agents

LangChainJun 7

LangChain Adds NVIDIA Nemotron 3 Ultra for Faster AI Agents

LangChain announced immediate support for NVIDIA Nemotron 3 Ultra, an open frontier model designed for long-running AI agents. This integration makes the model's 5x faster inference and up to 30% lower cost for complex agentic tasks directly available to developers using the LangChain framework.

NVIDIA Nemotron 3 Ultra Claims Top US Open Weights Intelligence Spot

Artificial AnalysisJun 1

NVIDIA Nemotron 3 Ultra Claims Top US Open Weights Intelligence Spot

NVIDIA released Nemotron 3 Ultra, a 550B-parameter model that leads US open-weights benchmarks with an intelligence score of 48. The model delivers high-throughput performance exceeding 300 tokens per second, significantly outpacing similarly sized frontier models from China.

RunwayMar 20

Runway Unveils Real-Time Video Model Built with NVIDIA Hardware

Runway shared a research preview of a real-time video generation model developed with NVIDIA, running on Vera Rubin hardware. HD video generates instantly — time-to-first-frame under 100ms — opening a fundamentally different design space for video generation and world simulation.

What is NVIDIA Dynamo Snapshot?

How does Dynamo Snapshot reduce AI startup times?

What are the performance benefits of using Dynamo Snapshot?

Which inference engines are compatible with Dynamo Snapshot?

Is NVIDIA Dynamo Snapshot available for public use?

Keep reading

NVIDIA DynoSim Simulates LLM Inference Stacks 1,500x Faster Than Real-Time

NVIDIA DynoSim Simulates LLM Inference Stacks 1,500x Faster Than Real-Time

LangChain Adds NVIDIA Nemotron 3 Ultra for Faster AI Agents

LangChain Adds NVIDIA Nemotron 3 Ultra for Faster AI Agents

NVIDIA Nemotron 3 Ultra Claims Top US Open Weights Intelligence Spot

NVIDIA Nemotron 3 Ultra Claims Top US Open Weights Intelligence Spot

Runway Unveils Real-Time Video Model Built with NVIDIA Hardware

Runway Unveils Real-Time Video Model Built with NVIDIA Hardware

Keep reading

NVIDIA DynoSim Simulates LLM Inference Stacks 1,500x Faster Than Real-Time

NVIDIA DynoSim Simulates LLM Inference Stacks 1,500x Faster Than Real-Time

LangChain Adds NVIDIA Nemotron 3 Ultra for Faster AI Agents

LangChain Adds NVIDIA Nemotron 3 Ultra for Faster AI Agents

NVIDIA Nemotron 3 Ultra Claims Top US Open Weights Intelligence Spot

NVIDIA Nemotron 3 Ultra Claims Top US Open Weights Intelligence Spot

Runway Unveils Real-Time Video Model Built with NVIDIA Hardware

Runway Unveils Real-Time Video Model Built with NVIDIA Hardware