Traditional coding benchmarks do not reflect how software is actually built and maintained. That's why we built a new benchmark, APEX-SWE, in partnership with @cognition. It measures whether AI models can perform complex, real-world software engineering work to ship systems https://t.co/5nJfKQBxfA
Mercor Launches APEX-SWE Benchmark for Real Production Software Engineering
· Updated
APEX-SWE, built by Mercor and Cognition, tests whether AI models can handle real production software engineering. It covers integration tasks (building systems across cloud services and business APIs) and observability tasks (diagnosing production failures from logs). GPT-5.3 Codex leads the launch leaderboard at 41.5% Pass@1; every model fails over half the tasks.
Traditional coding benchmarks have become saturated, presenting a misleading picture of AI coding ability — developers spend only 16% of their time writing code, per IDC. The remaining 84% is deployment, monitoring, and debugging: exactly what APEX-SWE measures. Models that top 75% on SWE-bench Verified drop to under 42% here.
If you're deciding how much to trust AI agents for production engineering — not just code writing — APEX-SWE gives you the number that matters. The eval harness and a 50-task dev set are open-source on GitHub and Hugging Face for anyone testing models against real work.
adarsh
@adarsh_exe
117retweets
View on X




