M
Mercor
@mercor_ai
https://t.co/CxrLJa2Yk3
8retweets65likes
View on X· Updated
This challenges the assumption that improving models requires massive datasets. In the corporate law test, the baseline produced a professional-looking but factually wrong memo. The post-trained model correctly identified tax code violations, citing specific USC sections - better reasoning, not just pattern matching. Hundreds of expert examples outperformed volume.
The APEX-Agents benchmark and dataset are on Mercor's Hugging Face, with experiment infrastructure on their GitHub. The technical report includes trajectory-level observability showing how models attempted each task.
https://t.co/CxrLJa2Yk3
More like this


PerplexityApr 24