HeadsUpAI

Fireworks AI adds Step 3.7 Flash for high speed agentic reasoning

Fireworks AI is now hosting Step 3.7 Flash, a 198B-parameter sparse Mixture-of-Experts (MoE) vision-language model (an architecture that activates only a subset of parameters for each task). Developed by StepFun, the model pairs a 196B language backbone with a 1.8B vision encoder for native multimodal understanding.
Total Parameters
198B
Active Parameters
11B
Throughput
Up to 400 tokens per second
Context Window
256k tokens
Reasoning Levels
Low, Medium, High

This deployment follows the addition of MiniMax M3 to the platform. Engineered for high-frequency production workloads, the model activates only 11B parameters per token despite its massive total count. This sparse activation lets it reach up to 400 tokens per second, enabling real-time agentic loops.

While also available via the Nous Portal integration, the Fireworks deployment offers a 256k context window (the total information a model processes at once). The implementation includes three selectable reasoning levels—low, medium, and high—and uses an Apache 2.0 license.

Fireworks AI
Fireworks AI
@FireworksAI_HQ
X

Many research labs only consider inference efficiency after the fact. Step 3.7 Flash is a 198B sparse MoE VLM designed by @StepFun_ai for inference from the start. 196B language backbone with a 1.8B vision encoder. Built for real-world agent workloads, running at up to 400 tok/sec. Native multimodal understanding and action, reliable tool use, and enhanced web and visual search. Apache 2.0. Try it now → https://t.co/OYqzBUBxqL

2retweets10likes
View on X

Still wondering? A few quick answers below.

Step 3.7 Flash is a 198B-parameter vision-language model developed by StepFun. It uses a sparse Mixture-of-Experts architecture to activate only 11B parameters per token, allowing for high-speed inference. It is designed for agentic workloads, including coding, tool use, and multimodal reasoning across a 256k context window.

On the Fireworks AI platform, Step 3.7 Flash can reach a throughput of up to 400 tokens per second. This speed is achieved through the model's sparse architecture and Fireworks' optimized inference stack, making it suitable for real-time agentic loops that require rapid, multi-step reasoning and action.

The model features three selectable reasoning levels: low, medium, and high. This allows developers to dynamically adjust the model's cognitive depth based on the complexity of the task. By choosing a level, users can balance the trade-offs between generation speed, operational cost, and the accuracy of complex reasoning.

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

Share this update