HeadsUpAI

NVIDIA Releases SANA-WM Open Source World Model for Minute-Long Video

NVIDIA released SANA-WM, a 2.6B-parameter open-source world model (an AI system that simulates physical environments) natively trained for minute-scale video generation. It uses a single image, text, and a 6-DoF camera trajectory (movement across six axes) to render 720p video. Hybrid Linear Attention maintains world coherence for 60 seconds.
Parameters (Backbone)
2.6B
Parameters (Refiner)
17B
Video resolution
720p
Video duration
60 seconds
Inference hardware
Single H100 or RTX 5090
Availability
Open source (weights and code)

This release bridges the gap between short-form clips and the Google DeepMind navigable environments required for robotics. By achieving industrial quality on a single GPU, NVIDIA is validating its NVIDIA video world model roadmap as a pretraining paradigm. It shifts focus from generation to controllable simulation that respects physical camera paths.

You can access the model weights, code, and paper immediately to build simulators or content tools. While training required 64 H100s, inference runs on a single H100. A distilled variant can denoise a 60-second clip in 34 seconds on an RTX 5090, making long-horizon modeling accessible for local development.

NVIDIA AI
NVIDIA AI
@NVIDIAAI
X

One image + text + camera trajectory = controllable worlds. All on a single GPU. Our research team just released SANA-WM, a 2.6B open source world model natively trained for 60-second video generation with precise camera control. https://t.co/oXHRCnCRdM

153retweets1.1klikes
View on X

Still wondering? A few quick answers below.

SANA-WM is a 2.6B-parameter open-source world model designed to generate high-fidelity, minute-long videos. Unlike standard video generators, it acts as a simulator that turns a single starting image and a specific camera trajectory into a consistent 720p environment. It is specifically optimized to maintain visual coherence for a full 60 seconds.

The model uses a Hybrid Linear Attention mechanism that combines frame-wise Gated DeltaNet with periodic softmax attention. This allows it to handle long-context video data efficiently without running out of memory. A two-stage pipeline first generates a base 2.6B rollout, which is then enhanced by a 17B long-video refiner to improve texture and motion quality.

Yes, NVIDIA has released SANA-WM as an open-source project. The release includes the model weights for both the bidirectional variant and the long-video refiner, along with the underlying code and the original research paper. Developers can access these resources on GitHub and Hugging Face to build their own controllable video simulations.

While NVIDIA used 64 H100 GPUs to train the model over 15 days, it is designed for efficient inference on a single GPU. A standard H100 can generate a 60-second 720p clip. Additionally, a distilled version using specialized quantization can run on a consumer-grade RTX 5090, producing a one-minute video in approximately 34 seconds.

NVIDIA states that SANA-WM achieves visual quality comparable to large-scale industrial baselines like LingBot-World and HY-WorldPlay. However, it is significantly more efficient, offering up to 36 times higher throughput than prior open-source baselines. It also demonstrates superior accuracy in following precise 6-DoF camera trajectories compared to existing open-source world models.

Share this update