Cursor ships improved Composer checkpoints every five hours using real-time RL

Cursor

Mar 28, 2026 · Updated Apr 25, 2026

Cursor is now using real-time reinforcement learning to train its Composer agent using actual user interactions as reward signals instead of relying solely on simulations. This pipeline allows the team to deploy an improved model checkpoint every five hours, ensuring the training data remains on-policy. By using real-world human oversight as the ground truth, the system eliminates the modeling errors typically found in simulated environments.

Cursor has implemented a real-time RL pipeline for Composer, its agentic coding feature. The system collects billions of tokens from user interactions to produce new model checkpoints. This cycle—including safety evaluations through CursorBench—takes five hours, enabling the team to ship multiple production updates every day.

Traditional reinforcement learning for coding relies on simulations that often fail to model the human in the loop. By using inference tokens from real users, Cursor addresses the train-test mismatch where models might otherwise learn to game benchmarks. This creates a proprietary data loop that specializes the model for specific user behaviors.

Improved checkpoints deploy automatically behind the Auto setting in Cursor. This continuous training has already yielded a 2.28% increase in edit persistence and a 10.3% reduction in latency. Future iterations will focus on longer-horizon tasks where user feedback provides higher-fidelity signals for complex agentic outcomes.

View the full update on cursor.com

Cursor

@cursor_aiMar 26

Earlier this week, we published our technical report on Composer 2. We're sharing additional research on how we train new checkpoints. With real-time RL, we can ship improved versions of the model every five hours. https://t.co/f75l7Qa4fr

104

View on X

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

See all AI news & updates from Cursor →

Keep reading

Cursor Trains Composer to Self-Summarize via RL for Longer Coding Tasks

Cursor trained Composer, its agentic coding model, to self-summarize through reinforcement learning rather than prompt-based compaction. This cuts context compaction error by 50% while using one-fifth the tokens, letting Composer handle tasks requiring hundreds of turns.

Cursor and Fireworks AI Detail the Specialized Training Infrastructure Behind Composer 2.5

Fireworks AIMay 27

Cursor and Fireworks AI Detail the Specialized Training Infrastructure Behind Composer 2.5

Cursor and Fireworks AI shared a technical breakdown of the distributed reinforcement learning infrastructure used to build the Composer 2.5 coding model. The team treats model weights as finite storage bits dedicated entirely to software engineering, allowing the model to match frontier performance at one-tenth the cost. This shift demonstrates how specialized products can use real-world usage as a proprietary training loop.