Running LLMs across Cloudflare’s network requires us to be smarter and more efficient about GPU memory bandwidth. That’s why we developed Unweight, a lossless inference-time compression system that achieves up to a 22% model footprint reduction. https://t.co/3wGOTTnlfQ
Cloudflare Unweight Achieves Lossless LLM Compression to Solve GPU Memory Bottlenecks
Cloudflare· Updated
Cloudflare developed Unweight, a lossless compression system that reduces LLM footprints by up to 22% while maintaining bit-exact outputs. By decompressing weights directly in on-chip memory, the system bypasses the memory bandwidth bottleneck that typically leaves high-end GPUs idle during inference.
BF16 weights.- Compression ratio (MLP weights)
- ~30%
- Total model footprint reduction
- 15-22%
- VRAM savings (Llama 3.1 8B)
- ~3 GB
- Throughput overhead
- 30-40%
- Target hardware
- NVIDIA H100 (Hopper)
- Compression type
- Lossless (bit-exact)
Modern GPUs like the NVIDIA H100 process data 600 times faster than memory can deliver it, creating a massive bandwidth bottleneck. Unlike standard quantization that trades accuracy for speed, Unweight preserves the original model's intelligence. This efficiency mirrors the pattern seen in specialized inference paths designed to maximize hardware throughput.
You can now access the technical paper and open-source GPU kernels to implement these reconstructive matrix multiplication techniques. While currently optimized for Llama 3.1 8B, the system follows the launch of persistent state management and extends Cloudflare's autonomous agent infrastructure to enable cheaper, faster model serving.
Still wondering? A few quick answers below.
Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →




