HeadsUpAI

NVIDIA Dynamo Snapshot Cuts AI Inference Startup Times to Under Five Seconds

NVIDIA announced Dynamo Snapshot, a framework using checkpointing (saving a running process's state) to eliminate cold-start delays in Kubernetes. It combines CRIU for host-side state with a cuda-checkpoint tool to serialize GPU memory. This allows a fully warmed inference worker to resume execution nearly instantly on any node in a cluster.
Startup time
Under 5 seconds
Startup speedup (120B model)
21x
Snapshot size (Qwen3-0.6B)
6.2 GiB
Snapshot size (gpt-oss-120b)
129 GiB
Supported engines
vLLM, SGLang

AI demand is elastic, but the minutes required to load weights often cause SLA violations during spikes. This update extends the NVIDIA Dynamo 1.0 inference OS to address the infrastructure bottleneck that forced providers to over-provision hardware. By unmapping the KV cache, it makes rapid, serverless-style scaling economically viable.

The experimental release supports single-GPU workloads running vLLM or SGLang. Future updates will add support for TensorRT-LLM and multi-node clusters. The snapshot agent is available via Helm to help you implement sub-5-second auto-scaling, with core CRIU optimizations currently pending an upstream merge.

NVIDIA AI
NVIDIA AI
@NVIDIAAI
X

Introducing Dynamo Snapshot, our approach for fast startup for inference workloads on Kubernetes, which reduces startup time from minutes to under 5 seconds. In production inference deployments demand fluctuates over time. Cold-starting inference workloads can take minutes, leaving idle GPUs that generate no tokens and serve no requests. Snapshot leverages GMS to enable concurrent weight restoration over a high-speed interconnect, while using Linux native AIO and parallel memfd restoration to accelerate CRIU restore performance.

32retweets182likes
View on X

Still wondering? A few quick answers below.

NVIDIA Dynamo Snapshot is a checkpoint and restore framework designed to eliminate cold-start latency for AI inference on Kubernetes. It works by freezing the state of a running inference worker, including both CPU and GPU memory, and saving it as an artifact that can be restored nearly instantly on any node in a cluster.

The system uses a technique called GPU Memory Service to decouple heavy model weights from the process state. By restoring weights and process data concurrently over high-speed channels like GPUDirect Storage, it avoids the slow, serial loading process of traditional cold starts. It also optimizes memory by unmapping the KV cache before saving the snapshot.

For large-scale models like gpt-oss-120b, Dynamo Snapshot can reduce startup times from several minutes to under five seconds, representing a 21x speedup. This allows production environments to scale elastically to meet traffic spikes without keeping expensive GPUs idle, significantly improving hardware utilization and reducing the risk of service level agreement violations.

The current experimental release of Dynamo Snapshot supports single-GPU workloads using the vLLM and SGLang inference engines. NVIDIA is currently working to expand this support to include TensorRT-LLM, as well as adding capabilities for multi-GPU and multi-node clusters through specialized quiesce and resume hooks for distributed runtimes like NCCL and PyTorch.

Dynamo Snapshot is being rolled out incrementally as an experimental release. Developers can currently access the snapshot agent as a privileged DaemonSet installable via a Helm chart. While some core memory optimizations are still pending an upstream merge into the CRIU project, the agent is fully portable across different Kubernetes environments and cloud providers.

Share this update