NVIDIA PixelDiT Achieves State-of-the-Art Pixel-Space Image Generation by Removing Autoencoders

NVIDIA

Jun 6, 2026 · Updated Jun 15, 2026

NVIDIA Research's PixelDiT, a single-stage image generation model, was selected as a best paper finalist at CVPR 2026. It removes the autoencoder step common in most image generation models, learning the diffusion process directly in pixel space to preserve fine details and achieve state-of-the-art performance among pixel-space generative models. This approach addresses a key limitation in image quality by eliminating a lossy compression step.

NVIDIA Research developed PixelDiT (Pixel Diffusion Transformers), an image generation model selected as a best paper finalist at CVPR 2026. It removes the pretrained autoencoder step common in most image generation models, which causes quality loss. PixelDiT operates as a single-stage model, learning diffusion directly in pixel space.

Award: CVPR 2026 Best Paper Finalist
FID Score (ImageNet 256): 1.61
Architecture: Single-stage Pixel Diffusion Transformer
Key Mechanism: Direct pixel-space optimization
Availability: arXiv, GitHub, Hugging Face

This direct approach avoids artifacts and blurring of fine details like text and texture from autoencoders. PixelDiT achieved a 1.61 FID score on ImageNet 256, making it state-of-the-art among pixel-space generative models and competitive with latent diffusion models.

PixelDiT's ability to generate and edit images while preserving intricate details offers new fidelity for generative AI applications. The project page provides access to the arXiv paper, code, and ImageNet models on Hugging Face for researchers and developers.

View the full update on pixeldit.github.io

NVIDIA AI

@NVIDIAAIJun 5

Selected as a best paper finalist at #CVPR2026: PixelDiT from NVIDIA Research In most image generation models, a pretrained autoencoder compresses the image before any diffusion happens, causing quality loss that accumulates across the entire pipeline. PixelDiT, or Pixel Diffusion Transformers, removes this step entirely. It's a single-stage model that learns the diffusion process directly in pixel space, end-to-end.

1480

View on X

Still wondering? A few quick answers below.

PixelDiT, or Pixel Diffusion Transformers, is an image generation model developed by NVIDIA Research. It's designed to create high-fidelity images by learning the diffusion process directly in pixel space, bypassing the need for a separate autoencoder that can introduce quality loss.

Most image generation models use an autoencoder to compress images, which can blur fine details like text and texture. PixelDiT removes this lossy compression step, allowing it to preserve intricate details and avoid artifacts, resulting in higher fidelity outputs.

PixelDiT scored 1.61 FID on ImageNet 256, which makes it state-of-the-art among pixel-space generative models. This performance is also competitive with the best latent diffusion models, demonstrating its capability in generating high-quality images.

The project page provides access to the PixelDiT arXiv paper for technical details. The code is available on GitHub, and pretrained ImageNet models are accessible on Hugging Face, allowing researchers and developers to use and build upon the model.

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

See all AI news & updates from NVIDIA →

Keep reading

NVIDIA NitroGen Model Wins CVPR Award for Multiverse Embodied Agents

NVIDIA's NitroGen foundation model for generalist gaming agents received a CVPR Best Paper Honorable Mention. This recognition marks progress toward AI agents that master physics across both real-world and simulated environments, a step toward more adaptable, general-purpose embodied AI.

Google DeepMind's TIPSv2 Advances Multimodal AI with Enhanced Spatial Awareness

GoogleJun 7

Google DeepMind's TIPSv2 Advances Multimodal AI with Enhanced Spatial Awareness

Google DeepMind is presenting TIPSv2, a new foundational image-text encoder, at CVPR 2026. This model enhances spatial awareness and patch-text alignment, improving performance across vision and multimodal applications, including strong gains in zero-shot segmentation.

NVIDIA Cosmos 3 takes top open weights rank with agentic reasoning

Artificial AnalysisJun 1

NVIDIA Cosmos 3 takes top open weights rank with agentic reasoning

NVIDIA's Cosmos 3 Super models have reached #1 on the Artificial Analysis open-weights leaderboards for both image and video generation. The system uses a reasoning-based architecture to refine prompts before generating high-fidelity visual content.

RunwayMar 20

Runway Unveils Real-Time Video Model Built with NVIDIA Hardware

Runway shared a research preview of a real-time video generation model developed with NVIDIA, running on Vera Rubin hardware. HD video generates instantly — time-to-first-frame under 100ms — opening a fundamentally different design space for video generation and world simulation.

What is PixelDiT?

How does PixelDiT improve image quality?

What performance has PixelDiT achieved?

Where can I access PixelDiT?

Keep reading

NVIDIA NitroGen Model Wins CVPR Award for Multiverse Embodied Agents

NVIDIA NitroGen Model Wins CVPR Award for Multiverse Embodied Agents

Google DeepMind's TIPSv2 Advances Multimodal AI with Enhanced Spatial Awareness

Google DeepMind's TIPSv2 Advances Multimodal AI with Enhanced Spatial Awareness

NVIDIA Cosmos 3 takes top open weights rank with agentic reasoning

NVIDIA Cosmos 3 takes top open weights rank with agentic reasoning

Runway Unveils Real-Time Video Model Built with NVIDIA Hardware

Runway Unveils Real-Time Video Model Built with NVIDIA Hardware

Keep reading

NVIDIA NitroGen Model Wins CVPR Award for Multiverse Embodied Agents

NVIDIA NitroGen Model Wins CVPR Award for Multiverse Embodied Agents

Google DeepMind's TIPSv2 Advances Multimodal AI with Enhanced Spatial Awareness

Google DeepMind's TIPSv2 Advances Multimodal AI with Enhanced Spatial Awareness

NVIDIA Cosmos 3 takes top open weights rank with agentic reasoning

NVIDIA Cosmos 3 takes top open weights rank with agentic reasoning

Runway Unveils Real-Time Video Model Built with NVIDIA Hardware

Runway Unveils Real-Time Video Model Built with NVIDIA Hardware