Cloudflare Unweight Achieves Lossless LLM Compression to Solve GPU Memory Bottlenecks

Cloudflare

Apr 19, 2026 · Updated Apr 25, 2026

Cloudflare developed Unweight, a lossless compression system that reduces LLM footprints by up to 22% while maintaining bit-exact outputs. By decompressing weights directly in on-chip memory, the system bypasses the memory bandwidth bottleneck that typically leaves high-end GPUs idle during inference.

Cloudflare, a network and security company, introduced Unweight, a lossless compression system that shrinks LLM model weights by 15–22% without sacrificing quality. It targets the "memory wall" in inference (running a trained model to generate outputs) by using Huffman coding to compress redundant exponent bits in BF16 weights.

Compression ratio (MLP weights): ~30%
Total model footprint reduction: 15-22%
VRAM savings (Llama 3.1 8B): ~3 GB
Throughput overhead: 30-40%
Target hardware: NVIDIA H100 (Hopper)
Compression type: Lossless (bit-exact)

Modern GPUs like the NVIDIA H100 process data 600 times faster than memory can deliver it, creating a massive bandwidth bottleneck. Unlike standard quantization that trades accuracy for speed, Unweight preserves the original model's intelligence. This efficiency mirrors the pattern seen in specialized inference paths designed to maximize hardware throughput.

You can now access the technical paper and open-source GPU kernels to implement these reconstructive matrix multiplication techniques. While currently optimized for Llama 3.1 8B, the system follows the launch of persistent state management and extends Cloudflare's autonomous agent infrastructure to enable cheaper, faster model serving.

View the full update on blog.cloudflare.com

Cloudflare

@CloudflareApr 18

Running LLMs across Cloudflare’s network requires us to be smarter and more efficient about GPU memory bandwidth. That’s why we developed Unweight, a lossless inference-time compression system that achieves up to a 22% model footprint reduction. https://t.co/3wGOTTnlfQ

16134

View on X

Still wondering? A few quick answers below.

Unweight is a lossless compression system developed by Cloudflare to reduce the memory footprint of large language models during inference. It achieves a 15% to 22% reduction in model size without any loss in accuracy. By shrinking weights, it allows more models to fit on a single GPU while improving memory bandwidth efficiency.

Unweight uses Huffman coding—a technique that assigns shorter codes to frequent values—to compress the exponent bits of model weights. Unlike lossy quantization that reduces precision, Unweight leaves the sign and mantissa bits untouched. This ensures the final output is bit-exact to the original model while requiring fewer bytes to cross the GPU memory bus.

High-end GPUs like the NVIDIA H100 often sit idle waiting for data from main memory. Unweight solves this by decompressing weights inside the GPU's fast on-chip shared memory. This reconstructive approach—where weights are rebuilt during the math operation—feeds the tensor cores directly, reducing the amount of data that must travel across the slower memory bus.

Yes, Cloudflare has open-sourced the GPU kernels for Unweight and published a technical research paper detailing the system's architecture. The project is intended to encourage innovation in GPU efficiency and compression. Developers can examine the reconstructive matrix multiplication kernels and the autotuning process used to optimize performance across different batch sizes.

While Unweight reduces memory usage, the on-chip decompression adds computational work. Current results on Llama 3.1 8B show a throughput overhead of roughly 30% to 40% depending on the batch size. Cloudflare is actively optimizing these kernels to reduce this gap, specifically by targeting the down projection weights and improving small-batch execution efficiency.

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

See all AI news & updates from Cloudflare →

Keep reading

Cloudflare Launches Agent Memory to Give AI Agents Persistent Long Term State

Cloudflare introduced Agent Memory, a managed service that extracts and stores key information from agent conversations to prevent context rot. By moving state management to a dedicated pipeline, agents can recall past decisions and facts across sessions without exhausting their context windows.

CohereMay 21

Cohere Releases Command A+ W4A4 Weights for Single GPU Serving

Cohere released W4A4 quantized weights for its 218-billion parameter Command A+ model, enabling frontier-class reasoning on a single NVIDIA B200 GPU. By using quantization-aware distillation to maintain performance, the update allows enterprises to deploy massive agentic models with a significantly smaller hardware footprint.

Cloudflare adds MiniMax M3 with 1M context for agentic coding

MiniMaxJun 2

Cloudflare adds MiniMax M3 with 1M context for agentic coding

Cloudflare has integrated the MiniMax M3 foundation model into its AI Gateway platform. The update provides developers with a high-context, multimodal model specialized for autonomous coding tasks directly within their existing infrastructure.

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

QwenMay 27

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

Qwen achieved a record 580 tokens per second running its Qwen3.5-397B-A17B model on NVIDIA Blackwell GPUs using the TokenSpeed inference engine. The optimization targets agentic workloads, where multi-turn reasoning and tool-calling typically suffer from high latency. By combining a hybrid attention architecture with deep kernel fusion, the system maintains high throughput even as context scales to one million tokens.

What is Cloudflare Unweight?

How does Unweight compress model weights without losing quality?

How does Unweight solve the GPU memory bandwidth bottleneck?

Is Cloudflare Unweight open source?

What is the performance impact of using Unweight?

Keep reading

Cloudflare Launches Agent Memory to Give AI Agents Persistent Long Term State

Cloudflare Launches Agent Memory to Give AI Agents Persistent Long Term State

Cohere Releases Command A+ W4A4 Weights for Single GPU Serving

Cohere Releases Command A+ W4A4 Weights for Single GPU Serving

Cloudflare adds MiniMax M3 with 1M context for agentic coding

Cloudflare adds MiniMax M3 with 1M context for agentic coding

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

Keep reading

Cloudflare Launches Agent Memory to Give AI Agents Persistent Long Term State

Cloudflare Launches Agent Memory to Give AI Agents Persistent Long Term State

Cohere Releases Command A+ W4A4 Weights for Single GPU Serving

Cohere Releases Command A+ W4A4 Weights for Single GPU Serving

Cloudflare adds MiniMax M3 with 1M context for agentic coding

Cloudflare adds MiniMax M3 with 1M context for agentic coding

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs