NVIDIA Launches LongLive-2.0 for 4-Bit Long Video Generation Infrastructure

NVIDIA

May 22, 2026 · Updated Jun 13, 2026

NVIDIA released LongLive-2.0, an end-to-end infrastructure that uses 4-bit floating point precision across the entire training and inference lifecycle for long-form video. By aligning 4-bit weights and activations with a multi-shot attention mechanism, the system achieves real-time speeds while maintaining subject consistency over minute-long generations.

NVIDIA introduced LongLive-2.0, a parallel infrastructure that brings 4-bit floating point (NVFP4) precision to the entire long-video generation workflow. Unlike standard methods that quantize models after training, this system uses 4-bit-aware training and distillation to ensure the model is optimized for low-precision deployment from the start.

Inference speed: 45.7 FPS (GB200)
Training speedup: 2.1x over BF16
Peak memory: 19.4GB (NVFP4 KV cache)
Resolution: 720p
Availability: GitHub (Code, Models, Paper)

Long video generation is a systems challenge because memory and compute requirements scale sharply with duration. LongLive-2.0 addresses this by implementing W4A4 inference and an NVFP4 KV cache, reducing peak memory to 19.4GB. This efficiency enables real-time long-video generation on NVIDIA Blackwell hardware at 45.7 frames per second.

You can generate 720p video with consistent subjects across multiple shots using a new multi-shot attention sink. The framework supports prompt switching at chunk boundaries, making it suitable for complex, minute-scale storytelling. NVIDIA has released the full project, including the research paper, code, and pre-trained models, on GitHub for immediate implementation.

View the full update on nvlabs.github.io

NVIDIA AI

@NVIDIAAIMay 22

Long video generation is a systems problem. Introducing LongLive-2.0 from NVIDIA Research: an end-to-end NVFP4 training and inference system for long video generation. Low-precision deployment often relies on post-training quantization, creating a gap between how models are trained and how they run. LongLive-2.0 aligns NVFP4-aware training, distillation, and W4A4 inference, maintaining strong benchmark quality while improving speed and memory efficiency.

450

View on X

Still wondering? A few quick answers below.

LongLive-2.0 is a parallel infrastructure from NVIDIA Research designed for the training and inference of long-form video generation models. It uses 4-bit floating point precision throughout the entire model lifecycle to solve the systems problem of high memory and compute costs that typically occur when generating videos that are several minutes long.

The system uses a unified 4-bit approach called NVFP4 for both weights and activations during inference. It also implements balanced sequence parallelism to shard encoding tasks across multiple GPUs and a multi-shot attention sink mechanism. These technical optimizations allow the system to maintain subject consistency and high frame rates during extended video generation sessions.

When running on NVIDIA Blackwell GB200 hardware, the system achieves an inference speed of 45.7 frames per second for 720p video. Compared to standard 16-bit precision training, LongLive-2.0 delivers a 2.1x speedup for training 64-second videos and reduces peak memory usage to 19.4GB by utilizing a compressed 4-bit KV cache.

NVIDIA has made the LongLive-2.0 project publicly available for research and development purposes. The release includes the full technical research paper, the underlying source code, pre-trained models, and interactive demos. Developers and researchers can access these resources through the official NVIDIA Research GitHub repository and the project's dedicated website.

The infrastructure uses a specialized multi-shot attention sink that preserves the identity of subjects and backgrounds across different scenes. It employs a global sink to maintain overall identity throughout the entire video and a shot-level sink that rebinds at every scene change. This allows for minute-scale streaming without needing to recompute the entire history.

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

See all AI news & updates from NVIDIA →

Keep reading

Runway Unveils Real-Time Video Model Built with NVIDIA Hardware

Runway shared a research preview of a real-time video generation model developed with NVIDIA, running on Vera Rubin hardware. HD video generates instantly — time-to-first-frame under 100ms — opening a fundamentally different design space for video generation and world simulation.

NVIDIA Blackwell Accelerates Llama 3 Training with NVFP4 Precision

NVIDIAJun 9

NVIDIA Blackwell Accelerates Llama 3 Training with NVFP4 Precision

NVIDIA trained Llama 3 8B and 405B models on its Blackwell platform using NVFP4 precision. This achieved a 1.31–1.73x speedup compared to FP8 precision, with no loss in accuracy. The update demonstrates how specialized hardware and precision formats can significantly boost the efficiency of large language model development.

Hao AI LabMay 27

Hao AI Lab Open Sources Dreamverse for Real Time Video Directing

Hao AI Lab released Dreamverse, an open-source reference application that generates 30-second 1080p videos in 7 seconds on a single NVIDIA B200 GPU. The system introduces vibe directing, a workflow where creators steer video generation through natural language in a real-time interactive loop.

Cohere Integrates W4A8 Inference into vLLM for Faster Hopper Performance

CohereApr 24

Cohere Integrates W4A8 Inference into vLLM for Faster Hopper Performance

Cohere released production-ready W4A8 quantization kernels for dense and Mixture of Experts models, now integrated into the vLLM inference framework. By combining 4-bit weights with 8-bit activations, the update achieves up to 58 percent faster prefill and 45 percent faster decoding on NVIDIA Hopper GPUs.

What is NVIDIA LongLive-2.0?

How does LongLive-2.0 achieve high performance for long videos?

What are the performance benchmarks for LongLive-2.0?

Is LongLive-2.0 open source and available to the public?

How does LongLive-2.0 maintain consistency in multi-shot videos?

Keep reading

Runway Unveils Real-Time Video Model Built with NVIDIA Hardware

Runway Unveils Real-Time Video Model Built with NVIDIA Hardware

NVIDIA Blackwell Accelerates Llama 3 Training with NVFP4 Precision

NVIDIA Blackwell Accelerates Llama 3 Training with NVFP4 Precision

Hao AI Lab Open Sources Dreamverse for Real Time Video Directing

Hao AI Lab Open Sources Dreamverse for Real Time Video Directing

Cohere Integrates W4A8 Inference into vLLM for Faster Hopper Performance

Cohere Integrates W4A8 Inference into vLLM for Faster Hopper Performance

Keep reading

Runway Unveils Real-Time Video Model Built with NVIDIA Hardware

Runway Unveils Real-Time Video Model Built with NVIDIA Hardware

NVIDIA Blackwell Accelerates Llama 3 Training with NVFP4 Precision

NVIDIA Blackwell Accelerates Llama 3 Training with NVFP4 Precision

Hao AI Lab Open Sources Dreamverse for Real Time Video Directing

Hao AI Lab Open Sources Dreamverse for Real Time Video Directing

Cohere Integrates W4A8 Inference into vLLM for Faster Hopper Performance

Cohere Integrates W4A8 Inference into vLLM for Faster Hopper Performance