Cognition's SWE-1.6 Preview Beats SWE-1.5 by 11% on Agentic Coding Benchmark

Cognition

Mar 1, 2026 · Updated Apr 25, 2026

Cognition released an early SWE-1.6 preview scoring 51.7% on SWE-Bench Pro — an 11-point jump over SWE-1.5 at the same 950 tok/s speed. It beats top open-source models on the benchmark, with early access rolling out to select users.

Cognition, the team behind Devin, shared an early checkpoint of their ongoing SWE-1.6 training run. Trained with 100x more RL compute than SWE-1.5 on a 6x faster stack, it scores 51.7% on SWE-Bench Pro — an 11-point jump over SWE-1.5's 40.1% — at the same 950 tok/s speed.

The training surfaced a finding Cognition calls the "Model UX" gap: higher benchmark scores don't always mean better real-world behavior. The model iterates harder on tough problems but also overthinks and over-verifies in ways that hurt interactivity — an active area of research.

Early access is rolling out to a small group to help tune this behavior. Follow Cognition's platform updates if you want to test SWE-1.6 before the full release.

View the full update on cognition.ai

Cognition

@cognitionMar 1

We are sharing an early preview of our ongoing SWE-1.6 training run. It significantly improves upon SWE-1.5 while being post-trained on the same pre-trained model - and it runs equally as fast at 950 tok/s. On SWE-Bench Pro it exceeds top open-source models. The preview model https://t.co/poTXyKcKnj

113

View on X

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

Keep reading

Cognition Releases SWE-1.6 with Optimized Agent UX and Unmatched Speed

Cognition launched SWE-1.6 in the Windsurf IDE, a model post-trained to eliminate agentic friction like looping and overthinking. It delivers up to 950 tokens per second, prioritizing efficient multi-step trajectories over raw benchmark gains.

adarshMar 26

Mercor Launches APEX-SWE Benchmark for Real Production Software Engineering

Mercor and Cognition launched APEX-SWE, a benchmark testing AI models on real software engineering — system integration, debugging production failures — not just writing code. Traditional benchmarks miss 84% of dev work. Even the top model scores just 41.5%.

Cursor Publishes CursorBench, Its Internal Agentic Coding Evaluation Methodology

OpenAIMar 15

Cursor Publishes CursorBench, Its Internal Agentic Coding Evaluation Methodology

Cursor published CursorBench, its internal eval suite that scores models on real coding agent tasks from actual developer sessions. Public benchmarks struggle to differentiate frontier models reliably — CursorBench produces more separation where it matters most.

Windsurf Adds Claude Opus 4.7 Fast Mode to Accelerate Agentic Coding

WindsurfMay 13

Windsurf Adds Claude Opus 4.7 Fast Mode to Accelerate Agentic Coding

Windsurf added support for Claude Opus 4.7 fast mode, advertised at roughly 2.5x higher output speeds while preserving the model's full intelligence. The integration is live inside the Windsurf IDE built by agent lab Cognition.