Mercor Launches APEX-SWE Benchmark for Real Production Software Engineering

adarsh

Mar 26, 2026 · Updated Apr 25, 2026

Mercor and Cognition launched APEX-SWE, a benchmark testing AI models on real software engineering — system integration, debugging production failures — not just writing code. Traditional benchmarks miss 84% of dev work. Even the top model scores just 41.5%.

APEX-SWE, built by Mercor and Cognition, tests whether AI models can handle real production software engineering. It covers integration tasks (building systems across cloud services and business APIs) and observability tasks (diagnosing production failures from logs). GPT-5.3 Codex leads the launch leaderboard at 41.5% Pass@1; every model fails over half the tasks.

Traditional coding benchmarks have become saturated, presenting a misleading picture of AI coding ability — developers spend only 16% of their time writing code, per IDC. The remaining 84% is deployment, monitoring, and debugging: exactly what APEX-SWE measures. Models that top 75% on SWE-bench Verified drop to under 42% here.

If you're deciding how much to trust AI agents for production engineering — not just code writing — APEX-SWE gives you the number that matters. The eval harness and a 50-task dev set are open-source on GitHub and Hugging Face for anyone testing models against real work.

View the full update on mercor.com

adarsh

@adarsh_exeMar 24

Traditional coding benchmarks do not reflect how software is actually built and maintained. That's why we built a new benchmark, APEX-SWE, in partnership with @cognition. It measures whether AI models can perform complex, real-world software engineering work to ship systems https://t.co/5nJfKQBxfA

117

View on X

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

Keep reading

Cognition Introduces FrontierCode to Evaluate AI Code Mergeability and Quality

Cognition launched FrontierCode, a new benchmark for evaluating AI-generated code quality and mergeability. This evaluation moves beyond basic functional correctness to assess if AI code meets production standards, addressing the challenge of models producing functional but unmaintainable code.

Cognition's SWE-1.6 Preview Beats SWE-1.5 by 11% on Agentic Coding Benchmark

swyxMar 1

Cognition's SWE-1.6 Preview Beats SWE-1.5 by 11% on Agentic Coding Benchmark

Cognition released an early SWE-1.6 preview scoring 51.7% on SWE-Bench Pro — an 11-point jump over SWE-1.5 at the same 950 tok/s speed. It beats top open-source models on the benchmark, with early access rolling out to select users.

MercorJan 28

Small Volumes of Expert-Labeled Data Nearly Double AI Model Performance

Mercor and Applied Compute post-trained an open-source model using fewer than 1,000 expert-labeled tasks, nearly doubling Pass@1 scores on the APEX-Agents benchmark. Corporate law scores tripled. Small volumes of high-quality data can dramatically outperform massive datasets for specialized professional work.

Cursor Publishes CursorBench, Its Internal Agentic Coding Evaluation Methodology

OpenAIMar 15

Cursor Publishes CursorBench, Its Internal Agentic Coding Evaluation Methodology

Cursor published CursorBench, its internal eval suite that scores models on real coding agent tasks from actual developer sessions. Public benchmarks struggle to differentiate frontier models reliably — CursorBench produces more separation where it matters most.