HeadsUpAI

Fireworks AI Launches Infrastructure for Training Trillion Parameter MoE Models

· Updated

Fireworks AI updated its Training SDK with a specialized engine for trillion-parameter Mixture-of-Experts models like Qwen3.5 and Kimi K2.5. The system introduces composable 4D parallelism, which automatically orchestrates data, pipeline, context, and expert sharding. This infrastructure recently powered the training of Cursor's Composer 2 model.

Training frontier models is increasingly an infrastructure bottleneck rather than a modeling one. The new stack utilizes MXFP8 kernels on NVIDIA Blackwell hardware to deliver significant speedups over BF16 without losing numerical accuracy. Fused reinforcement learning losses also provide a 2x performance boost for PPO by eliminating redundant forward passes.

You can now access these training shapes through the Training SDK to fine-tune models at context lengths up to one million tokens. For resource-constrained environments, the platform supports LoRA fine-tuning of trillion-parameter models on a single 8-GPU node using 4x expert quantization. Managed fine-tuning and custom training loops are available via the API.

Fireworks AI
Fireworks AI
@FireworksAI_HQ
X

Training trillion-parameter MoEs is an infra problem disguised as a modeling problem. So we built the infra solution. Cursor used it to train Composer 2. Now it's available for Kimi K2.5, Qwen3.5 397B, MiniMax M2.5, and more: →Fused RL loss (~2x faster PPO) →MXFP8 expert kernels on Blackwell →Composable 4D parallelism →1M+ token context training validated Here's how it all works ↓ https://t.co/PA20I8EFaD

26retweets245likes
View on X

Share this update