Cohere Proves Mixture of Experts Models Amplify Speculative Decoding Gains

CohereCohere

· Updated

Cohere validated that Mixture-of-Experts models achieve higher speedups from speculative decoding than dense models by staying in a memory-bandwidth-bound sweet spot. The research shows that consecutive tokens naturally reuse the same experts, significantly reducing the data-loading bottleneck during parallel verification.

Cohere, an AI company building enterprise models for search and business applications, released research showing that Mixture-of-Experts (MoE) architectures—which activate only a subset of parameters per token—actually enhance speculative decoding (a technique where a small model predicts tokens for verification).
Unique expert reduction
31%
Expert overlap (step 1)
38%
Draft model cost
14.3% of target decode
Verification ratio (BS=1)
1.25x
Acceptance length (AL)
2.73

This challenges the assumption that loading multiple experts during verification would erase speed gains. It mirrors the pattern seen in optimized inference paths for Blackwell GPUs, where reducing data-shuffling overhead is critical. Cohere proved that temporal correlation between adjacent tokens reduces unique weight loading by up to 31%.

You can apply these insights by co-optimizing model sparsity and batch sizes to stay in the bandwidth-bound regime. For high-volume workloads, lowering the active expert ratio preserves these speedups at scale. These findings follow the release of optimized W4A8 quantization kernels for the vLLM engine and Command models.

Still wondering? A few quick answers below.

Speculative decoding is an inference technique where a small, fast draft model predicts upcoming tokens that a larger target model then verifies in parallel. In Mixture of Experts models, which only activate a small portion of their parameters per token, this process allows the system to generate multiple tokens for nearly the cost of a single forward pass.

Mixture of Experts models have lower arithmetic intensity, meaning they stay in a memory-bandwidth-bound state for longer than dense models. This creates a sweet spot at moderate batch sizes where the target model can verify multiple predicted tokens without hitting compute limits, making the extra verification tokens essentially free in terms of processing time.

Research shows that consecutive tokens in a sequence tend to activate the same experts, a property called temporal correlation. Because of this overlap, verifying four tokens only requires loading about 2.5 times the unique expert weights rather than four times. This significantly reduces the amount of data that must be moved from memory during the verification step.

Unlike dense models where speedup decreases as batch size grows, MoE models show a non-monotonic curve. Speedup first increases as batch size moves toward a sweet spot where expert loading is amortized, then eventually declines once the batch size becomes large enough to make the model compute-bound rather than limited by memory bandwidth.

At a batch size of one, speculative decoding provides an extra boost by spreading fixed costs—like attention mechanisms and kernel launches—across multiple tokens. Since these operations cost roughly the same regardless of the number of tokens being processed, verifying several tokens at once significantly improves efficiency compared to generating them one by one.

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

Share this update