Selected as a best paper finalist at #CVPR2026: PixelDiT from NVIDIA Research In most image generation models, a pretrained autoencoder compresses the image before any diffusion happens, causing quality loss that accumulates across the entire pipeline. PixelDiT, or Pixel Diffusion Transformers, removes this step entirely. It's a single-stage model that learns the diffusion process directly in pixel space, end-to-end.
NVIDIA PixelDiT Achieves State-of-the-Art Pixel-Space Image Generation by Removing Autoencoders
NVIDIA· Updated
NVIDIA Research's PixelDiT, a single-stage image generation model, was selected as a best paper finalist at CVPR 2026. It removes the autoencoder step common in most image generation models, learning the diffusion process directly in pixel space to preserve fine details and achieve state-of-the-art performance among pixel-space generative models. This approach addresses a key limitation in image quality by eliminating a lossy compression step.
- Award
- CVPR 2026 Best Paper Finalist
- FID Score (ImageNet 256)
- 1.61
- Architecture
- Single-stage Pixel Diffusion Transformer
- Key Mechanism
- Direct pixel-space optimization
- Availability
- arXiv, GitHub, Hugging Face
This direct approach avoids artifacts and blurring of fine details like text and texture from autoencoders. PixelDiT achieved a 1.61 FID score on ImageNet 256, making it state-of-the-art among pixel-space generative models and competitive with latent diffusion models.
PixelDiT's ability to generate and edit images while preserving intricate details offers new fidelity for generative AI applications. The project page provides access to the arXiv paper, code, and ImageNet models on Hugging Face for researchers and developers.
Still wondering? A few quick answers below.
Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →




