Command A+ is available on @huggingface with W4A4 quantization 🤗 Cut your serving footprint dramatically with virtually zero performance degradation. Try it now: https://t.co/USXpmpid01
Cohere Releases Command A+ W4A4 Weights for Single GPU Serving
Cohere, an enterprise AI company, released W4A4 quantized weights for its Command A+ launch on Hugging Face. This 4-bit format uses NVFP4 quantization (compressing weights and activations to 4 bits) to shrink the serving footprint of the 218-billion parameter model. The release includes specialized support for
vLLM and a new response parsing library.- Active parameters
- 25B
- Total parameters
- 218B
- Context window
- 128K tokens
- Hardware requirement
- 1x B200 or 2x H100
- License
- Apache 2.0
Reasoning models typically suffer a performance penalty when compressed, as errors compound during long decoding steps. Cohere mitigated this by using quantization-aware distillation, training a smaller student model to match the full-precision teacher. This allows the model to run on a single NVIDIA B200 or two H100s with virtually no degradation in benchmark quality.
You can now download the W4A4 weights under the Apache 2.0 license for private deployment. The model supports 128K context and 48 languages, making it a viable option for global agentic workflows requiring local data residency. To run the model, you will need vLLM version 0.21.0 or higher and the cohere_melody library.
Cohere
@cohere
12retweets79likes
View on XStill wondering? A few quick answers below.
Command A+ W4A4 is a highly compressed version of Cohere's 218-billion parameter model that uses 4-bit weights and 4-bit activations. This specific quantization methodology, known as NVFP4, targets the model's experts to reduce the hardware required for inference while maintaining the reasoning and multilingual capabilities of the original full-precision version.
The W4A4 quantization significantly reduces the serving footprint, allowing the model to run on a single NVIDIA B200 GPU or two H100 GPUs. In comparison, the standard 16-bit version requires eight H100 GPUs. This reduction makes it possible for enterprises to deploy frontier-class AI models on much smaller and more cost-effective private hardware setups.
Cohere claims that the W4A4 version shows virtually zero performance degradation on benchmarks compared to the full-precision model. To achieve this, they used quantization-aware distillation, a technique where a compressed student model is trained to mimic a high-quality teacher model, effectively closing the quality gap that usually occurs during heavy model compression.
Yes, Command A+ is released under the Apache 2.0 license, making it an open-weights model available for commercial and private use. Users can download the weights from Hugging Face in various formats, including the optimized W4A4 version, to host the model entirely within their own secure infrastructure without relying on external cloud providers.
To run this specific version, you must use vLLM version 0.21.0 or higher. Additionally, users need to install Cohere's melody library to ensure accurate response parsing and tool-use execution. The model is compatible with standard transformers pipelines and supports advanced features like conversational tool use, native citations, and internal reasoning tokens.




