NVIDIA PixelDiT Achieves State-of-the-Art Pixel-Space Image Generation by Removing Autoencoders

NVIDIANVIDIA

· Updated

NVIDIA Research's PixelDiT, a single-stage image generation model, was selected as a best paper finalist at CVPR 2026. It removes the autoencoder step common in most image generation models, learning the diffusion process directly in pixel space to preserve fine details and achieve state-of-the-art performance among pixel-space generative models. This approach addresses a key limitation in image quality by eliminating a lossy compression step.

NVIDIA Research developed PixelDiT (Pixel Diffusion Transformers), an image generation model selected as a best paper finalist at CVPR 2026. It removes the pretrained autoencoder step common in most image generation models, which causes quality loss. PixelDiT operates as a single-stage model, learning diffusion directly in pixel space.
Award
CVPR 2026 Best Paper Finalist
FID Score (ImageNet 256)
1.61
Architecture
Single-stage Pixel Diffusion Transformer
Key Mechanism
Direct pixel-space optimization
Availability
arXiv, GitHub, Hugging Face

This direct approach avoids artifacts and blurring of fine details like text and texture from autoencoders. PixelDiT achieved a 1.61 FID score on ImageNet 256, making it state-of-the-art among pixel-space generative models and competitive with latent diffusion models.

PixelDiT's ability to generate and edit images while preserving intricate details offers new fidelity for generative AI applications. The project page provides access to the arXiv paper, code, and ImageNet models on Hugging Face for researchers and developers.

NVIDIA AI
NVIDIA AI
@NVIDIAAI
X

Selected as a best paper finalist at #CVPR2026: PixelDiT from NVIDIA Research In most image generation models, a pretrained autoencoder compresses the image before any diffusion happens, causing quality loss that accumulates across the entire pipeline. PixelDiT, or Pixel Diffusion Transformers, removes this step entirely. It's a single-stage model that learns the diffusion process directly in pixel space, end-to-end.

14retweets80likes
View on X

Still wondering? A few quick answers below.

PixelDiT, or Pixel Diffusion Transformers, is an image generation model developed by NVIDIA Research. It's designed to create high-fidelity images by learning the diffusion process directly in pixel space, bypassing the need for a separate autoencoder that can introduce quality loss.

Most image generation models use an autoencoder to compress images, which can blur fine details like text and texture. PixelDiT removes this lossy compression step, allowing it to preserve intricate details and avoid artifacts, resulting in higher fidelity outputs.

PixelDiT scored 1.61 FID on ImageNet 256, which makes it state-of-the-art among pixel-space generative models. This performance is also competitive with the best latent diffusion models, demonstrating its capability in generating high-quality images.

The project page provides access to the PixelDiT arXiv paper for technical details. The code is available on GitHub, and pretrained ImageNet models are accessible on Hugging Face, allowing researchers and developers to use and build upon the model.

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

Share this update