HeadsUpAI

NVIDIA Launches LongLive-2.0 for 4-Bit Long Video Generation Infrastructure

NVIDIA introduced LongLive-2.0, a parallel infrastructure that brings 4-bit floating point (NVFP4) precision to the entire long-video generation workflow. Unlike standard methods that quantize models after training, this system uses 4-bit-aware training and distillation to ensure the model is optimized for low-precision deployment from the start.
Inference speed
45.7 FPS (GB200)
Training speedup
2.1x over BF16
Peak memory
19.4GB (NVFP4 KV cache)
Resolution
720p
Availability
GitHub (Code, Models, Paper)

Long video generation is a systems challenge because memory and compute requirements scale sharply with duration. LongLive-2.0 addresses this by implementing W4A4 inference and an NVFP4 KV cache, reducing peak memory to 19.4GB. This efficiency allows NVIDIA's SANA-WM world model to run on Blackwell hardware at 45.7 frames per second.

You can generate 720p video with consistent subjects across multiple shots using a new multi-shot attention sink. The framework supports prompt switching at chunk boundaries, making it suitable for complex, minute-scale storytelling. NVIDIA has released the full project, including the research paper, code, and pre-trained models, on GitHub for immediate implementation.

NVIDIA AI
NVIDIA AI
@NVIDIAAI
X

Long video generation is a systems problem. Introducing LongLive-2.0 from NVIDIA Research: an end-to-end NVFP4 training and inference system for long video generation. Low-precision deployment often relies on post-training quantization, creating a gap between how models are trained and how they run. LongLive-2.0 aligns NVFP4-aware training, distillation, and W4A4 inference, maintaining strong benchmark quality while improving speed and memory efficiency.

4retweets50likes
View on X

Still wondering? A few quick answers below.

LongLive-2.0 is a parallel infrastructure from NVIDIA Research designed for the training and inference of long-form video generation models. It uses 4-bit floating point precision throughout the entire model lifecycle to solve the systems problem of high memory and compute costs that typically occur when generating videos that are several minutes long.

The system uses a unified 4-bit approach called NVFP4 for both weights and activations during inference. It also implements balanced sequence parallelism to shard encoding tasks across multiple GPUs and a multi-shot attention sink mechanism. These technical optimizations allow the system to maintain subject consistency and high frame rates during extended video generation sessions.

When running on NVIDIA Blackwell GB200 hardware, the system achieves an inference speed of 45.7 frames per second for 720p video. Compared to standard 16-bit precision training, LongLive-2.0 delivers a 2.1x speedup for training 64-second videos and reduces peak memory usage to 19.4GB by utilizing a compressed 4-bit KV cache.

NVIDIA has made the LongLive-2.0 project publicly available for research and development purposes. The release includes the full technical research paper, the underlying source code, pre-trained models, and interactive demos. Developers and researchers can access these resources through the official NVIDIA Research GitHub repository and the project's dedicated website.

The infrastructure uses a specialized multi-shot attention sink that preserves the identity of subjects and backgrounds across different scenes. It employs a global sink to maintain overall identity throughout the entire video and a shot-level sink that rebinds at every scene change. This allows for minute-scale streaming without needing to recompute the entire history.

Share this update