Training trillion-parameter MoEs is an infra problem disguised as a modeling problem. So we built the infra solution. Cursor used it to train Composer 2. Now it's available for Kimi K2.5, Qwen3.5 397B, MiniMax M2.5, and more: →Fused RL loss (~2x faster PPO) →MXFP8 expert kernels on Blackwell →Composable 4D parallelism →1M+ token context training validated Here's how it all works ↓ https://t.co/PA20I8EFaD
Fireworks AI Launches Infrastructure for Training Trillion Parameter MoE Models
· Updated
Qwen3.5 and Kimi K2.5. The system introduces composable 4D parallelism, which automatically orchestrates data, pipeline, context, and expert sharding. This infrastructure recently powered the training of Cursor's Composer 2 model.Training frontier models is increasingly an infrastructure bottleneck rather than a modeling one. The new stack utilizes MXFP8 kernels on NVIDIA Blackwell hardware to deliver significant speedups over BF16 without losing numerical accuracy. Fused reinforcement learning losses also provide a 2x performance boost for PPO by eliminating redundant forward passes.
You can now access these training shapes through the Training SDK to fine-tune models at context lengths up to one million tokens. For resource-constrained environments, the platform supports LoRA fine-tuning of trillion-parameter models on a single 8-GPU node using 4x expert quantization. Managed fine-tuning and custom training loops are available via the API.





