Cohere Integrates W4A8 Inference into vLLM for Faster Hopper Performance

LLM
Enterprise AI
AI Hardware
Performance

Cohere Integrates W4A8 Inference into vLLM for Faster Hopper Performance
Cohere, an AI company building enterprise models for search and business applications, integrated W4A8 inference (a mixed-precision scheme using 4-bit weights and 8-bit activations) into the vLLM framework. This update targets NVIDIA Hopper architecture, optimizing for FP8 Tensor Cores to accelerate both prefill and decoding phases.

Standard 4-bit quantization often relies on 16-bit activations, saving memory but missing the fastest compute engines on modern GPUs. This release mirrors the industry shift toward specialized quantization by bridging that gap. A lookup-table approach dequantizes weights without the scalar math overhead that typically bottlenecks FP8 kernels.

Deploy Command A or Mixture of Experts models with lower latency for long-context agentic workflows. The integration includes a new token masking feature in the llm-compressor library, allowing calibration on traces up to 64k tokens while excluding repetitive system prompts. These kernels are available now via the official vLLM repository.

Read the full update →

Frequently asked questions

What is W4A8 quantization?
W4A8 is a mixed-precision quantization scheme that uses 4-bit weights to reduce memory footprint and 8-bit activations to maximize compute throughput. This approach targets the NVIDIA Hopper architecture, allowing models to run efficiently in both memory-bound decoding and compute-bound prefill regimes by utilizing high-performance FP8 Tensor Cores for matrix multiplication.
How much faster is Cohere's W4A8 inference compared to W4A16?
Cohere's W4A8 kernels deliver significant speed improvements on NVIDIA Hopper GPUs compared to the previous W4A16 standard. Users can expect up to a 58 percent reduction in Time to First Token and a 45 percent improvement in Time Per Output Token. These gains scale consistently across different batch sizes for both dense and Mixture of Experts models.
What is token masking in the context of model calibration?
Token masking is a technique added to the llm-compressor library that allows developers to exclude repetitive or non-informative tokens from calibration statistics. By masking system prompts, templates, and tool descriptions, the calibration process focuses on useful optimization space. This is essential for preserving performance in long-context agentic workflows where repetitive prompts can bias quantization decisions.
Is the Cohere W4A8 integration available in vLLM?
Yes, Cohere has integrated its production-ready W4A8 dense and grouped GEMM kernels directly into the vLLM inference framework. This integration supports both dense models and Mixture of Experts architectures. Developers can access these performance optimizations through the official vLLM repository to deploy models like Command A with improved efficiency on compatible NVIDIA hardware.
How does Cohere maintain model quality at 4-bit precision?
Cohere uses two primary techniques to recover model quality: Quantization-Aware Distillation and a specialized lookup-table approach. Distillation trains a quantized student model to match a higher-precision teacher's output. Additionally, they apply per-channel scales and manual scaling to ensure weights stay within the FP8 range without clipping, restoring accuracy to within 99.5 percent of the original baseline.