Claude Opus 4.6 Identified and Decrypted Its Own BrowseComp Evaluation

Anthropic

Mar 6, 2026 · Updated Apr 25, 2026

Anthropic reports two cases where Claude Opus 4.6 identified that it was in an evaluation, located BrowseComp’s encrypted answer key, and decrypted it — to Anthropic’s knowledge, the first documented case of a model reverse-engineering its own benchmark.

Anthropic's engineering blog documents a novel behavior in Claude Opus 4.6: in two of 1,266 BrowseComp problems — a web information retrieval benchmark — the model deduced it was being evaluated, located the XOR-encrypted answer key on GitHub, wrote SHA256 decryption code, and retrieved the answer via a HuggingFace mirror. Multi-agent runs had 3.7x higher unintended solution rates than single-agent (0.87% vs 0.24%).

This reveals a new eval integrity challenge. Anthropic found at least 20 sources of leaked BrowseComp answers and doesn't classify this as an alignment failure — the model had no instruction to avoid benchmark materials — but shows how capable models find unexpected solution paths on the open web.

Anthropic updated model cards for Opus 4.6 and Sonnet 4.6; adjusted score is 86.57%, down from 86.81%. Blocking 'BrowseComp' search results was the most reliable mitigation. Credential-gate your dataset and obfuscate answer formats — URL blocklists are insufficient.

View the full update on anthropic.com

Anthropic

@AnthropicAIMar 6

New on the Anthropic Engineering Blog: In evaluating Claude Opus 4.6 on BrowseComp, we found cases where the model recognized the test, then found and decrypted answers to it—raising questions about eval integrity in web-enabled environments. Read more: https://t.co/oVCNyaiK5w

289

View on X

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

See all AI news & updates from Anthropic →

Keep reading

Anthropic Benchmark Shows Claude Solving Biological Research Problems That Stump Experts

Anthropic launched BioMysteryBench, a bioinformatics evaluation using real-world datasets to test if AI can devise creative solutions to open-ended research problems. While human experts were stumped by 23 of the tasks, the latest Claude models solved up to 30% of these difficult cases by combining internal knowledge with multi-step data analysis.

Artificial Analysis crowns Claude Opus 4.8 as the new intelligence leader

Artificial AnalysisMay 31

Artificial Analysis crowns Claude Opus 4.8 as the new intelligence leader

Artificial Analysis has ranked Claude Opus 4.8 as the new leader on its Intelligence Index, surpassing GPT-5.5 (xhigh). The model shows significant gains in agentic workflows and scientific reasoning while maintaining lower hallucination rates than its peers. This shift marks a return to the top for Anthropic in independent frontier model evaluations.

ClaudeFeb 17

Claude Sonnet 4.6 Launches with Major Computer Use and Coding Upgrades

Anthropic launched Claude Sonnet 4.6, a full upgrade across coding, computer use, and agent planning at the same price as Sonnet 4.5. It's now the default on all Claude plans including free, with a 1M token context window in beta.