Cohere Proves Mixture of Experts Models Amplify Speculative Decoding Gains

Cohere

Apr 24, 2026 · Updated Apr 25, 2026

Cohere validated that Mixture-of-Experts models achieve higher speedups from speculative decoding than dense models by staying in a memory-bandwidth-bound sweet spot. The research shows that consecutive tokens naturally reuse the same experts, significantly reducing the data-loading bottleneck during parallel verification.

Cohere, an AI company building enterprise models for search and business applications, released research showing that Mixture-of-Experts (MoE) architectures—which activate only a subset of parameters per token—actually enhance speculative decoding (a technique where a small model predicts tokens for verification).

Unique expert reduction: 31%
Expert overlap (step 1): 38%
Draft model cost: 14.3% of target decode
Verification ratio (BS=1): 1.25x
Acceptance length (AL): 2.73

This challenges the assumption that loading multiple experts during verification would erase speed gains. It mirrors the pattern seen in optimized inference paths for Blackwell GPUs, where reducing data-shuffling overhead is critical. Cohere proved that temporal correlation between adjacent tokens reduces unique weight loading by up to 31%.

You can apply these insights by co-optimizing model sparsity and batch sizes to stay in the bandwidth-bound regime. For high-volume workloads, lowering the active expert ratio preserves these speedups at scale. These findings follow the release of optimized W4A8 quantization kernels for the vLLM engine and Command models.

View the full update on cohere.com

Cohere

@cohereApr 22

Get more from speculative decoding in MoE models https://t.co/JHVcCUAmZT

734

View on X

Still wondering? A few quick answers below.

Speculative decoding is an inference technique where a small, fast draft model predicts upcoming tokens that a larger target model then verifies in parallel. In Mixture of Experts models, which only activate a small portion of their parameters per token, this process allows the system to generate multiple tokens for nearly the cost of a single forward pass.

Mixture of Experts models have lower arithmetic intensity, meaning they stay in a memory-bandwidth-bound state for longer than dense models. This creates a sweet spot at moderate batch sizes where the target model can verify multiple predicted tokens without hitting compute limits, making the extra verification tokens essentially free in terms of processing time.

Research shows that consecutive tokens in a sequence tend to activate the same experts, a property called temporal correlation. Because of this overlap, verifying four tokens only requires loading about 2.5 times the unique expert weights rather than four times. This significantly reduces the amount of data that must be moved from memory during the verification step.

Unlike dense models where speedup decreases as batch size grows, MoE models show a non-monotonic curve. Speedup first increases as batch size moves toward a sweet spot where expert loading is amortized, then eventually declines once the batch size becomes large enough to make the model compute-bound rather than limited by memory bandwidth.

At a batch size of one, speculative decoding provides an extra boost by spreading fixed costs—like attention mechanisms and kernel launches—across multiple tokens. Since these operations cost roughly the same regardless of the number of tokens being processed, verifying several tokens at once significantly improves efficiency compared to generating them one by one.

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

See all AI news & updates from Cohere →

Keep reading

Cohere Integrates W4A8 Inference into vLLM for Faster Hopper Performance

Cohere released production-ready W4A8 quantization kernels for dense and Mixture of Experts models, now integrated into the vLLM inference framework. By combining 4-bit weights with 8-bit activations, the update achieves up to 58 percent faster prefill and 45 percent faster decoding on NVIDIA Hopper GPUs.

Cursor Releases Warp Decode for 1.84x Faster MoE Inference on Blackwell GPUs

CursorApr 7

Cursor Releases Warp Decode for 1.84x Faster MoE Inference on Blackwell GPUs

Anysphere rebuilt the Mixture of Experts inference path for NVIDIA Blackwell GPUs, achieving 1.84x faster throughput by assigning GPU warps to individual output neurons. This warp decode approach eliminates the data-shuffling overhead typical of expert-centric models while improving output accuracy by 1.4x.

Perplexity Benchmarks Blackwell Performance for High Throughput Large Model Inference

PerplexityMay 12

Perplexity Benchmarks Blackwell Performance for High Throughput Large Model Inference

Perplexity published research showing that NVIDIA's GB200 Blackwell architecture nearly halves communication latency for large Mixture-of-Experts models compared to the previous generation. The findings suggest that Blackwell is a primary platform for reducing the cost and latency of serving frontier-scale AI search.

Together AI Optimizes MiniMax M3 Inference with New Systems Kernels

Together AI5h ago

Together AI Optimizes MiniMax M3 Inference with New Systems Kernels

Together AI implemented custom engineering optimizations to serve MiniMax M3 at production scale. The team built a KV-block-major sparse attention kernel, integrated paged attention for MSA, and optimized decode index scoring. These changes, alongside a Rust-based multimodal preprocessing gateway, delivered 81–125% throughput improvements across varying concurrency levels for the 1-million-token context model.

What is speculative decoding in Mixture of Experts models?

Why do Mixture of Experts models get better speedups from speculative decoding than dense models?

How does expert routing correlation affect speculative decoding performance?

How does batch size impact the speed of speculative decoding for MoE models?

What role does fixed-overhead amortization play in low batch size inference?

Keep reading

Cohere Integrates W4A8 Inference into vLLM for Faster Hopper Performance

Cohere Integrates W4A8 Inference into vLLM for Faster Hopper Performance

Cursor Releases Warp Decode for 1.84x Faster MoE Inference on Blackwell GPUs

Cursor Releases Warp Decode for 1.84x Faster MoE Inference on Blackwell GPUs

Perplexity Benchmarks Blackwell Performance for High Throughput Large Model Inference

Perplexity Benchmarks Blackwell Performance for High Throughput Large Model Inference

Together AI Optimizes MiniMax M3 Inference with New Systems Kernels

Together AI Optimizes MiniMax M3 Inference with New Systems Kernels

Keep reading

Cohere Integrates W4A8 Inference into vLLM for Faster Hopper Performance

Cohere Integrates W4A8 Inference into vLLM for Faster Hopper Performance

Cursor Releases Warp Decode for 1.84x Faster MoE Inference on Blackwell GPUs

Cursor Releases Warp Decode for 1.84x Faster MoE Inference on Blackwell GPUs

Perplexity Benchmarks Blackwell Performance for High Throughput Large Model Inference

Perplexity Benchmarks Blackwell Performance for High Throughput Large Model Inference

Together AI Optimizes MiniMax M3 Inference with New Systems Kernels

Together AI Optimizes MiniMax M3 Inference with New Systems Kernels