Cohere Proves Mixture of Experts Models Amplify Speculative Decoding Gains

This challenges the assumption that loading multiple experts during verification would erase speed gains. It mirrors the pattern seen in optimized inference paths for Blackwell GPUs, where reducing data-shuffling overhead is critical. Cohere proved that temporal correlation between adjacent tokens reduces unique weight loading by up to 31%.
You can apply these insights by co-optimizing model sparsity and batch sizes to stay in the bandwidth-bound regime. For high-volume workloads, lowering the active expert ratio preserves these speedups at scale. These findings are based on production benchmarks using the vLLM engine and Command models.
Frequently asked questions
- What is speculative decoding in Mixture of Experts models?
- Speculative decoding is an inference technique where a small, fast draft model predicts upcoming tokens that a larger target model then verifies in parallel. In Mixture of Experts models, which only activate a small portion of their parameters per token, this process allows the system to generate multiple tokens for nearly the cost of a single forward pass.
- Why do Mixture of Experts models get better speedups from speculative decoding than dense models?
- Mixture of Experts models have lower arithmetic intensity, meaning they stay in a memory-bandwidth-bound state for longer than dense models. This creates a sweet spot at moderate batch sizes where the target model can verify multiple predicted tokens without hitting compute limits, making the extra verification tokens essentially free in terms of processing time.
- How does expert routing correlation affect speculative decoding performance?
- Research shows that consecutive tokens in a sequence tend to activate the same experts, a property called temporal correlation. Because of this overlap, verifying four tokens only requires loading about 2.5 times the unique expert weights rather than four times. This significantly reduces the amount of data that must be moved from memory during the verification step.
- How does batch size impact the speed of speculative decoding for MoE models?
- Unlike dense models where speedup decreases as batch size grows, MoE models show a non-monotonic curve. Speedup first increases as batch size moves toward a sweet spot where expert loading is amortized, then eventually declines once the batch size becomes large enough to make the model compute-bound rather than limited by memory bandwidth.
- What role does fixed-overhead amortization play in low batch size inference?
- At a batch size of one, speculative decoding provides an extra boost by spreading fixed costs—like attention mechanisms and kernel launches—across multiple tokens. Since these operations cost roughly the same regardless of the number of tokens being processed, verifying several tokens at once significantly improves efficiency compared to generating them one by one.



