What is Fireworks AI?

Fireworks AI is AI inference platform for fast, customizable model serving and compound AI systems at scale. HeadsUpAI tracks Fireworks AI across the AI ecosystem and curates every significant update — the latest being "Fireworks AI Adds Managed Fine-Tuning for DeepSeek V4 Flash" (July 29, 2026) — so you get the whole story in a 30-second read.

What's new from Fireworks AI?

The most recent Fireworks AI update is "Fireworks AI Adds Managed Fine-Tuning for DeepSeek V4 Flash" (July 29, 2026). HeadsUpAI curates every significant Fireworks AI release as a 30-second read — what shipped and why it matters.

What are the latest Fireworks AI updates and releases?

The latest Fireworks AI updates: "Fireworks AI Adds Managed Fine-Tuning for DeepSeek V4 Flash", "Fireworks AI Launches Kimi K3 With Serverless Inference and Training", "Fireworks AI Launches Nexus to Route Tasks to Open Models", "Fireworks AI Launches Kimi K3 Frontier Open-Weight Model", and "Fireworks AI and Arize Benchmark Models on Cost per Successful Task". HeadsUpAI has curated 40 Fireworks AI updates over the last 90 days, covering product updates, analysis, and launches — listed newest first, presented straight, no hype, no bias.

What does Fireworks AI do?

Fireworks AI is AI inference platform for fast, customizable model serving and compound AI systems at scale. On this page you'll find every significant Fireworks AI development HeadsUpAI has tracked recently — product updates, analysis, and launches — so you can keep up with where Fireworks AI is heading without reading a dozen sources.

How often is Fireworks AI news updated here?

Continuously. HeadsUpAI adds new Fireworks AI updates as they're announced — usually within hours — and the 40 updates currently shown cover the past 90 days, newest first.

Fireworks AI AI News & Updates — Latest Releases & Features

Fireworks AI9h ago

Fireworks AI Adds Managed Fine-Tuning for DeepSeek V4 Flash

Fireworks AI now supports fine-tuning for the DeepSeek V4 Flash model. The platform provides supervised fine-tuning, preference tuning, and combined preference optimization through its managed UI, alongside reinforcement learning via the Training API. These tools allow for the customization of the 284B-parameter model for coding agents and high-volume production workloads.

Fireworks AIJul 27

Fireworks AI Launches Kimi K3 With Serverless Inference and Training

Fireworks AI launched the 2.8-trillion-parameter Kimi K3 model, featuring a 1-million-token context window and native vision capabilities. The platform provides US-hosted serverless inference with zero data retention, priced at $3 per million input tokens and $15 per million output tokens. Additionally, the model supports serverless training in private preview, removing the need for reserved GPU capacity.

Fireworks AIJul 27

Fireworks AI Launches Nexus to Route Tasks to Open Models

Fireworks AI launched Fireworks Nexus, a platform that routes routine AI tasks from expensive proprietary models to high-performing open-weight models like GLM-5.2 and Kimi-K3. The system integrates with existing developer tools via FireConnect to reduce overall AI spend by 3–5x. It provides centralized cost observability and enterprise controls for managing AI usage across engineering teams.

Fireworks AIJul 27

Fireworks AI Launches Kimi K3 Frontier Open-Weight Model

Fireworks AI launched the 2.8-trillion-parameter Kimi K3 model with day-0 support for serverless inference and training. The model features a 1-million-token context window, native vision, and reasoning performance matching closed frontier models. Fireworks provides US-hosted endpoints with zero data retention, priced at $3 per million input tokens and $15 per million output tokens.

Fireworks AIJul 23

Fireworks AI and Arize Benchmark Models on Cost per Successful Task

Fireworks AI and Arize AI benchmarked 10 models across 2,400 agent runs, finding that routing by task difficulty improves both cost and coverage. The study concludes that measuring cost per successful task, rather than per token, reveals the true economic impact of retries and silent failures. Kimi K3 performed competitively with GPT-5.5 in these agentic evaluations.

Fireworks AIJul 22

Fireworks AI Adds Managed Fine-Tuning and Training for MiniMax M3

Fireworks AI now supports training for the MiniMax M3 model. The platform provides managed LoRA SFT and DPO for standard workflows, alongside a Training API for custom SFT, DPO, and reinforcement learning loops. This API includes support for checkpointing, rollout inference, and adapter hotloading, allowing the adaptation of the 428B-parameter model to specific tasks.

Fireworks AIJul 21

Fireworks AI Benchmarks Kimi K3 and Fable for Per-Task Routing

Fireworks AI benchmarked Kimi K3 against Fable across 1,000 agentic tasks, finding that per-task routing achieves 93% accuracy and up to 50x lower cost than using Fable alone. The study shows K3 handles 72-96% of traffic, making the frontier model a fallback. Kimi K3 arrives on the Fireworks platform on July 27.

Fireworks AIJul 21

Heidi Health Fine-Tunes Open Model on Fireworks for Faster Performance

Heidi Health fine-tuned an open model on Fireworks AI, achieving higher quality than Gemini Pro in internal clinical note evaluations. The optimized model reduced latency from 25 seconds to 7 seconds, a 3.5x improvement. The team reached production in four weeks by using aggressive data filtering and scaling effective batch sizes to 1.5 million tokens during training.

Fireworks AIJul 14

Fireworks AI Hosts LangChain Deep Agents on NVIDIA Nemotron 3 Ultra

Fireworks AI now hosts the LangChain Deep Agents harness tuned for NVIDIA Nemotron 3 Ultra. This stack achieves benchmark-leading agent performance among open models at approximately 10x lower cost than closed alternatives. The platform supports day-zero deployment and allows users to post-train the model into specialized, business-owned intelligence.

Fireworks AIJul 11

Fireworks AI Builds Blackwell-Optimized Sparse Attention Kernel for MiniMax M3

Fireworks AI built a KV-stationary sparse-attention kernel for MiniMax M3 on NVIDIA Blackwell B200 GPUs. The kernel reaches ~980 TFLOP/s, delivering a 1.9–2.4× speedup over query-stationary baselines. By loading each KV block once and gathering queries, the implementation reduces irregular memory access, achieving 1.18–1.43× higher performance for the full attention module.

Fireworks AIJul 2

Fireworks AI Refreshes Batch API With 50% Lower Pricing

Fireworks AI launched a refreshed Batch API for asynchronous workloads, priced at 50% below serverless rates. The update introduces selectable completion windows of 12, 24, 48, or 72 hours and applies automatic prompt caching to further reduce costs. Datasets are submitted in JSONL format to process large-scale tasks without requiring real-time inference.

Fireworks AIJul 1

Fireworks AI Adds GLM 5.2 Support to Microsoft Foundry Routing

Fireworks AI launched GLM 5.2 on Microsoft Foundry, enabling enterprise-governed model serving. The FireConnect CLI now routes Codex, OpenCode, and Pi requests through Azure-deployed models, billing directly against Microsoft Foundry provisioned throughput units. This integration allows teams to use Fireworks-hosted open models within existing coding agent workflows while maintaining Azure-based billing and infrastructure control.

Fireworks AIJul 1

Fireworks AI Launches Serverless 2.0 for On-Demand Production Reliability

Fireworks AI launched Serverless 2.0, providing production-grade inference reliability without requiring reserved GPU capacity or long-term contracts. The update introduces a priority tier that allows users to pay for high-reliability compute only when needed. This model removes the need for guessing peak throughput requirements, offering dedicated-deployment performance on a pay-as-you-go basis.

Fireworks AIJun 30

Fireworks AI Ships GLM 5.2 Fast for Accelerated Model Serving

Fireworks AI launched GLM 5.2 Fast, a serving path for Z.ai's GLM 5.2 model that runs 2-3x faster than the Standard tier. It achieves 140 tokens per second on shared serverless infrastructure without reserved GPUs. The path maintains identical model quality and structured output behavior, accessible by updating the model ID to accounts/fireworks/routers/glm-5p2-fast.

Fireworks AIJun 28

Fireworks AI Details Distributed RL Infrastructure for Cursor Composer 2

Fireworks AI detailed the distributed reinforcement learning infrastructure used to train Cursor's Composer 2, finding that models exploit training environment flaws before learning user intent. The system uses 3–4 global clusters with compressed weight synchronization to run large-scale rollouts, achieving frontier coding performance at 6–10x lower inference costs than comparable models.

Fireworks AIJun 26

Fireworks AI Adds Reinforcement Learning Fine-Tuning for NVIDIA Nemotron 3

Fireworks AI now supports reinforcement learning fine-tuning for NVIDIA Nemotron 3, starting with the Nemotron 3 Super variant using LoRA. The platform utilizes the GRPO algorithm for training and allows deployment on the same infrastructure. Billing occurs by GPU-hour rather than per token, providing cost predictability for long multi-turn training rollouts.

Fireworks AIJun 25

Fireworks AI Makes Z.ai's GLM 5.2 Available Within Cursor Editor

Fireworks AI now provides inference for Z.ai's GLM 5.2 model directly within the Cursor code editor. This integration enables access to the open-source frontier model without switching the existing coding workflow or editor environment. The partnership between Fireworks AI and Cursor brings the model to the Cursor interface for immediate use.

Fireworks AIJun 24

Fireworks AI Launches Managed Reinforcement Learning Service for Frontier Models

Fireworks AI launched a managed reinforcement learning service that maintains numerical identity between training and inference. By ensuring zero Kullback-Leibler divergence end-to-end, the platform eliminates the numerical drift common in fragmented training stacks. The service is available now, starting with support for Z.ai's GLM 5.2 model.

Fireworks AIJun 24

Fireworks AI Adds Fine-Tuning for Z.ai's GLM 5.2 Coding Model

Fireworks AI opened private access for fine-tuning Z.ai's GLM 5.2 coding model, supporting supervised fine-tuning, direct preference optimization, and reinforcement learning. Trained models deploy directly on the same infrastructure used for inference, eliminating the need for handoffs or migration between training and production environments.

Fireworks AIJun 24

Fireworks AI Launches FireConnect CLI for Coding Agent Model Routing

Fireworks AI launched FireConnect, a CLI tool that redirects model requests from Claude Code, Pi, OpenCode, and Codex to Fireworks-hosted open models. Installing FireConnect with a single command swaps the default provider to alternatives like GLM-5.2, MiniMax, Qwen, DeepSeek, or Kimi, with automatic backup and restoration of prior settings included.

Fireworks AIJun 17

Fireworks AI Adds Day-Zero Inference Hosting for Z.ai's GLM 5.2

Fireworks AI added day-zero inference hosting for Z.ai's GLM 5.2, a 1M-token context coding model. The platform serves the model directly on its own infrastructure, ensuring zero data retention and production-grade latency. Integration occurs via OpenAI or Anthropic-compatible APIs, with input pricing at 1.40 dollars per million tokens and cached input tokens at 0.26 dollars.

Fireworks AIJun 17

Fireworks AI Adds Full Training Support for Kimi 2.7 Models

Fireworks AI now supports full training for Moonshot AI's Kimi 2.7 model. The platform provides SFT, DPO, and RL capabilities through a managed UI or raw API. Large context windows and high LoRA ranks allow for model customization, creating specialized systems that outperform frontier models at lower inference costs.

Fireworks AIJun 13

Fireworks AI Adds Qwen 3.7 Plus With Agentic Reasoning and Caching

Fireworks AI now serves Qwen 3.7 Plus as a direct inference provider, offering full control over latency and data paths. The model supports thinking and non-thinking modes, preserved reasoning history, and prompt caching by default. It is available on serverless endpoints compatible with OpenAI and Anthropic APIs, priced at 0.50 dollars per million input tokens.

Fireworks AIJun 13

Fireworks AI Adds Day-0 Support for MiniMax M3 Multimodal Model

Fireworks AI launched day-0 support for MiniMax M3, a multimodal model featuring native image and video input. Powered by MiniMax Sparse Attention, the model delivers 9× faster prefill and 15× faster decode speeds. It is available now on serverless and on-demand endpoints, priced at parity with M2.7 at $0.30 per million input tokens.

Fireworks AIJun 4

Fireworks AI Adds NVIDIA Nemotron 3 Ultra for Agentic Reasoning

Fireworks AI now offers NVIDIA Nemotron 3 Ultra, an open model for advanced autonomous agents, with immediate deployment support. This provides developers with optimized infrastructure for long-running agentic tasks that require frontier reasoning and orchestration.

Fireworks AIJun 4

Fireworks AI adds Step 3.7 Flash for high speed agentic reasoning

Fireworks AI has deployed Step 3.7 Flash, a 198B-parameter vision-language model designed for rapid inference. The model enables real-time agentic workflows by delivering up to 400 tokens per second with selectable reasoning depths.

Fireworks AIJun 4

Fireworks AI hosts MiniMax M3 with 15x faster long context decoding

Fireworks AI is now powering inference for MiniMax M3, a multimodal model featuring a novel sparse attention architecture. The partnership enables 15.6x faster decoding at 1-million-token context, making real-time agentic workflows viable at scale.

Fireworks AIJun 4

Fireworks AI Research Shows Hybrid Agents Outperform Monolithic Frontier Models

Fireworks AI demonstrated that a GLM 5.1 worker using Claude Opus 4.7 as a sparse advisor beats standalone Opus on legal benchmarks. This architectural shift achieves higher accuracy on complex tasks while reducing inference costs by over 60%.

Fireworks AIJun 2

Microsoft brings MAI reasoning models to Fireworks for enterprise fine-tuning

Microsoft AI is launching a family of seven in-house models and partnering with Fireworks AI to enable weight-level customization. This allows organizations to integrate proprietary institutional knowledge directly into Microsoft's frontier reasoning models.

Fireworks AIMay 30

Fireworks AI Serverless 2.0 Adds Priority Lanes Without Reserved GPUs

Fireworks AI launched Serverless 2.0, introducing three distinct serving paths—Standard, Priority, and Fast—to its inference platform. By allowing developers to choose between cost-efficiency, congestion reliability, or high throughput at the request level, the update removes the binary choice between shared fleets and expensive reserved capacity.

Fireworks AIMay 29

Fireworks AI earns NVIDIA CEO Jensen Huang endorsement as AI foundry

NVIDIA CEO Jensen Huang characterized Fireworks AI as the TSMC of AI factories, highlighting the company's specialized role in the inference stack. This endorsement signals a shift where high-performance inference providers are becoming the essential foundries for the generative AI era.

Watch

Fireworks AIMay 29

Ramp Labs Deploys 10,000 Agents on Fireworks AI to Slash Security Costs

Ramp Labs used a fleet of 10,000 autonomous agents powered by open-weight models to identify high-severity vulnerabilities in its production backend. The deployment achieved a fivefold reduction in token costs compared to GPT 5.5 while maintaining the reasoning depth required for complex security auditing.

Fireworks AIMay 27

Cursor and Fireworks AI Detail the Specialized Training Infrastructure Behind Composer 2.5

Cursor and Fireworks AI shared a technical breakdown of the distributed reinforcement learning infrastructure used to build the Composer 2.5 coding model. The team treats model weights as finite storage bits dedicated entirely to software engineering, allowing the model to match frontier performance at one-tenth the cost. This shift demonstrates how specialized products can use real-world usage as a proprietary training loop.

Fireworks AIMay 20

Fireworks AI Benchmark Reveals Hidden Execution Tax Draining Agent Budgets

Fireworks AI released a benchmark report on browser agents revealing that malformed outputs create a hidden execution tax that inflates production costs. The study found that reliability in multi-step loops matters more than raw intelligence, with some frontier models wasting nearly a quarter of their inference budget on retries.

Fireworks AIMay 16

Fireworks AI Adds RL for Gemma 4 Dense to Build Reasoning Agents

Fireworks AI expanded its training platform to support full-parameter and LoRA-based reinforcement learning for Google's Gemma 4 Dense model. This allows developers to perform SFT, DPO, or RL on the model's full 256K context window using a unified stack that eliminates numerical drift between training and production.

Fireworks AIMay 15

Fireworks AI Adds Managed Fine-Tuning for Qwen 3.6 27B

Fireworks AI launched managed fine-tuning for Alibaba's Qwen 3.6 27B model, supporting 256K context windows and out-of-the-box DPO. This allows developers to specialize a high-performance dense model for complex coding and reasoning tasks on a production-ready stack.

Fireworks AIMay 15

Fireworks AI Expands Azure Foundry Catalog With Frontier Reasoning Models

Fireworks AI added DeepSeek V4 Pro and Kimi K2.6 to Microsoft Azure AI Foundry while expanding provisioned throughput support to the US Data Zone. The update allows enterprise teams to run high-performance open models with guaranteed throughput and data residency within their existing Azure environment.

Fireworks AIMay 15

Fireworks AI Adds Reinforcement Learning for GLM 5.1 to Build Custom Agents

Fireworks AI expanded its training platform to support LoRA-based reinforcement learning for Z.ai's GLM 5.1 model. This allows developers to align the model's reasoning steps with specific domain logic using custom loss functions on a 200K context window.

Fireworks AIMay 13

Fireworks AI Launches Full Parameter RL Training for Kimi K2.6

Fireworks AI added full-parameter reinforcement learning support for Moonshot AI's 1-trillion parameter Kimi K2.6 model. This allows developers to tune the entire model weight set on proprietary data to build specialized agentic moats that outperform off-the-shelf frontier systems.

Fireworks AIMay 5

Fireworks AI Uses Delta Compression to Reduce Frontier RL Training Costs

Fireworks AI introduced a distributed reinforcement learning architecture that uses delta-compressed weight updates to sync training and inference clusters across different regions. By shipping only the 2% of weights that change between checkpoints, teams can train frontier-scale models using fragmented GPU capacity instead of expensive mega-clusters.