Cohere Releases Command A+ W4A4 Weights for Single GPU Serving

Cohere

May 21, 2026 · Updated Jun 13, 2026

Cohere released W4A4 quantized weights for its 218-billion parameter Command A+ model, enabling frontier-class reasoning on a single NVIDIA B200 GPU. By using quantization-aware distillation to maintain performance, the update allows enterprises to deploy massive agentic models with a significantly smaller hardware footprint.

Cohere, an enterprise AI company, released W4A4 quantized weights for its Command A+ launch on Hugging Face. This 4-bit format uses NVFP4 quantization (compressing weights and activations to 4 bits) to shrink the serving footprint of the 218-billion parameter model. The release includes specialized support for vLLM and a new response parsing library.

Active parameters: 25B
Total parameters: 218B
Context window: 128K tokens
Hardware requirement: 1x B200 or 2x H100
License: Apache 2.0

Reasoning models typically suffer a performance penalty when compressed, as errors compound during long decoding steps. Cohere mitigated this by using quantization-aware distillation, training a smaller student model to match the full-precision teacher. This allows the model to run on a single NVIDIA B200 or two H100s with virtually no degradation in benchmark quality.

You can now download the W4A4 weights under the Apache 2.0 license for private deployment. The model supports 128K context and 48 languages, making it a viable option for global agentic workflows requiring local data residency. To run the model, you will need vLLM version 0.21.0 or higher and the cohere_melody library.

View the full update on huggingface.co

Cohere

@cohereMay 21

Command A+ is available on @huggingface with W4A4 quantization 🤗 Cut your serving footprint dramatically with virtually zero performance degradation. Try it now: https://t.co/USXpmpid01

1279

View on X

Still wondering? A few quick answers below.

Command A+ W4A4 is a highly compressed version of Cohere's 218-billion parameter model that uses 4-bit weights and 4-bit activations. This specific quantization methodology, known as NVFP4, targets the model's experts to reduce the hardware required for inference while maintaining the reasoning and multilingual capabilities of the original full-precision version.

The W4A4 quantization significantly reduces the serving footprint, allowing the model to run on a single NVIDIA B200 GPU or two H100 GPUs. In comparison, the standard 16-bit version requires eight H100 GPUs. This reduction makes it possible for enterprises to deploy frontier-class AI models on much smaller and more cost-effective private hardware setups.

Cohere claims that the W4A4 version shows virtually zero performance degradation on benchmarks compared to the full-precision model. To achieve this, they used quantization-aware distillation, a technique where a compressed student model is trained to mimic a high-quality teacher model, effectively closing the quality gap that usually occurs during heavy model compression.

Yes, Command A+ is released under the Apache 2.0 license, making it an open-weights model available for commercial and private use. Users can download the weights from Hugging Face in various formats, including the optimized W4A4 version, to host the model entirely within their own secure infrastructure without relying on external cloud providers.

To run this specific version, you must use vLLM version 0.21.0 or higher. Additionally, users need to install Cohere's melody library to ensure accurate response parsing and tool-use execution. The model is compatible with standard transformers pipelines and supports advanced features like conversational tool use, native citations, and internal reasoning tokens.

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

See all AI news & updates from Cohere →

Keep reading

Cohere Launches Command A+ to Bring Frontier Agentic AI to Private Hardware

Cohere released Command A+, a 218-billion parameter open-source model optimized for complex reasoning and multimodal agentic tasks. By achieving high performance on as little as two H100 GPUs, the model allows enterprises to deploy frontier-class agents entirely within their own private infrastructure.

Cohere Releases North Mini Code, a Small Open-Weight Model for Coding

Artificial AnalysisJun 10

Cohere Releases North Mini Code, a Small Open-Weight Model for Coding

Cohere released North Mini Code, a small 30B parameter (3B active) open weights coding model. This model achieves competitive coding performance for its size and speed, positioning it as a focused option in the open-weight ecosystem.

Qwen 3.5 Series Releases GPTQ-Int4 Weights for Limited-GPU Inference

QwenMar 3

Qwen 3.5 Series Releases GPTQ-Int4 Weights for Limited-GPU Inference

Qwen 3.5 Series, Alibaba's open-weight model family, now has GPTQ-Int4 quantized weights available for its larger model sizes. They work natively with vLLM and SGLang, cutting VRAM needs so teams can run larger Qwen 3.5 models on constrained GPU setups.

GoogleApr 27

Google Optimizes Gemma 4 for High Concurrency Serving on Single GPUs

Google demonstrated that the Gemma 4 26B A4B model can handle more than 10 concurrent sessions on a single GPU without performance bottlenecks. This optimization allows developers to serve high-quality reasoning models at significantly lower hardware costs for multi-user or agentic workflows.

What is the Command A+ W4A4 quantization?

What are the hardware requirements for running Command A+ W4A4?

Does the W4A4 quantization reduce the performance of Command A+?

Is Cohere Command A+ open source?

How do you run the Command A+ W4A4 model?

Keep reading

Cohere Launches Command A+ to Bring Frontier Agentic AI to Private Hardware

Cohere Launches Command A+ to Bring Frontier Agentic AI to Private Hardware

Cohere Releases North Mini Code, a Small Open-Weight Model for Coding

Cohere Releases North Mini Code, a Small Open-Weight Model for Coding

Qwen 3.5 Series Releases GPTQ-Int4 Weights for Limited-GPU Inference

Qwen 3.5 Series Releases GPTQ-Int4 Weights for Limited-GPU Inference

Google Optimizes Gemma 4 for High Concurrency Serving on Single GPUs

Google Optimizes Gemma 4 for High Concurrency Serving on Single GPUs

Keep reading

Cohere Launches Command A+ to Bring Frontier Agentic AI to Private Hardware

Cohere Launches Command A+ to Bring Frontier Agentic AI to Private Hardware

Cohere Releases North Mini Code, a Small Open-Weight Model for Coding

Cohere Releases North Mini Code, a Small Open-Weight Model for Coding

Qwen 3.5 Series Releases GPTQ-Int4 Weights for Limited-GPU Inference

Qwen 3.5 Series Releases GPTQ-Int4 Weights for Limited-GPU Inference

Google Optimizes Gemma 4 for High Concurrency Serving on Single GPUs

Google Optimizes Gemma 4 for High Concurrency Serving on Single GPUs