Anthropic Benchmark Shows Claude Solving Biological Research Problems That Stump Experts

Anthropic

Apr 30, 2026 · Updated May 8, 2026

Anthropic launched BioMysteryBench, a bioinformatics evaluation using real-world datasets to test if AI can devise creative solutions to open-ended research problems. While human experts were stumped by 23 of the tasks, the latest Claude models solved up to 30% of these difficult cases by combining internal knowledge with multi-step data analysis.

Anthropic released BioMysteryBench, a bioinformatics benchmark that tasks AI with analyzing raw DNA, RNA, and metabolic data. Unlike previous evaluations using multiple-choice questions, this framework uses 99 verifiable problems derived from objective ground truths in messy, real-world biological datasets.

Total evaluation problems: 99
Human-difficult problems: 23
Claude Mythos Preview solve rate (difficult): 30%
Claude Opus 4.6 solve rate (human-solvable): 77.4%
Claude Sonnet 4.6 solve rate (human-solvable): Approx 70%
Availability: Hugging Face (BioMysteryBench-preview)

This shift toward agentic science evaluations addresses a critical gap in AI research. While models have long passed medical exams, BioMysteryBench measures if they can function as autonomous researchers. The results show Claude Mythos Preview—a restricted model with near-superhuman reasoning—solving 30% of human-difficult problems that a panel of five domain experts could not crack.

You can now access the benchmark preview on Hugging Face to test scientific agents in containerized environments. The data suggests frontier models are becoming viable collaborators for scientific discovery, though their superhuman wins remain brittle and less reliable than their performance on standard tasks.

View the full update on anthropic.com

Anthropic

@AnthropicAIApr 29

New on the Science Blog: We gave Claude 99 problems analyzing real biological data and compared its performance against an expert panel. On 23 problems, the experts were stumped. Our most recent models solved roughly 30% of those—and most of the rest. https://t.co/BYqr76zxhk

77955

View on X

Still wondering? A few quick answers below.

BioMysteryBench is a bioinformatics evaluation framework developed by Anthropic to test AI models on complex, open-ended research problems. It consists of 99 questions derived from real-world biological data, such as DNA and RNA sequencing. Unlike traditional benchmarks, it requires models to use bioinformatics tools and databases to reach objective, verifiable conclusions about messy biological systems.

In Anthropic testing, the latest Claude models demonstrated significant scientific reasoning capabilities. While a panel of five domain experts was unable to solve 23 of the most difficult problems, the Claude Mythos Preview model successfully solved roughly 30% of those cases. On tasks that humans could solve, the models achieved high reliability, often mirroring or improving upon human analytical strategies.

Claude employs two primary strategies when tackling bioinformatics tasks. First, it utilizes a vast internal knowledge base of structural biology and molecular profiles drawn from hundreds of thousands of scientific papers. Second, when uncertain, the model layers multiple analytical methods and combines different lines of evidence to verify its conclusions, a technique that helps it navigate noisy or complex datasets.

Yes, Anthropic has made a preview of BioMysteryBench available to the research community. Interested users can access the dataset and evaluation framework through Hugging Face to test the scientific capabilities of their own AI agents. The benchmark is designed to run in containerized environments where models have access to standard bioinformatics tools and external databases.

While Claude solves problems that stump experts, its performance on difficult tasks is currently brittle. On human-solvable problems, the model is highly consistent, but on harder questions, many correct answers come from reasoning paths it cannot reliably reproduce. This reliability gap suggests that while the capability frontier is moving, consistent superhuman performance is still developing.

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

See all AI news & updates from Anthropic →

Keep reading

Claude Accelerates Scientific Discoveries at Three Research Labs

Anthropic published case studies from three Stanford and MIT labs where Claude compressed months of analysis into minutes and surfaced discoveries human experts missed. The results show research AI moving from literature reviews into hypothesis generation and experimental design.

Claude Opus 4.8 takes top spot on agentic work benchmark

Artificial AnalysisJun 1

Claude Opus 4.8 takes top spot on agentic work benchmark

Anthropic's Claude Opus 4.8 has claimed the lead on the GDPval-AA leaderboard for agentic professional tasks. The model achieved an 1890 Elo rating, demonstrating a 67% win rate against GPT-5.5 xhigh in real-world work scenarios. This update establishes a new performance ceiling for AI agents capable of producing complex office deliverables.

ClaudeJan 12

Anthropic Launches Cowork: Claude Code for Non-Technical Tasks

Anthropic released Cowork, bringing Claude Code's approach to non-technical work. Give Claude access to a folder, describe your task, and it reads, edits, or creates files - turning scattered notes into drafts or screenshots into spreadsheets.

Cursor Adds Claude Fable 5, Sets New Coding Benchmark Record

CursorJun 10

Cursor Adds Claude Fable 5, Sets New Coding Benchmark Record

Cursor, an AI-first code editor, has made Anthropic's Claude Fable 5 model available within its platform. The model achieved a new state of the art on CursorBench 3.1 with a score of 72.9%, surpassing the previous best by 8 points. This update signifies a notable improvement in AI coding capabilities for complex development tasks.

What is BioMysteryBench?

How did Claude perform on BioMysteryBench compared to human experts?

What strategies does Claude use to solve bioinformatics problems?

Is BioMysteryBench available for public use?

What are the limitations of Claude's performance on this benchmark?

Keep reading

Claude Accelerates Scientific Discoveries at Three Research Labs

Claude Accelerates Scientific Discoveries at Three Research Labs

Claude Opus 4.8 takes top spot on agentic work benchmark

Claude Opus 4.8 takes top spot on agentic work benchmark

Anthropic Launches Cowork: Claude Code for Non-Technical Tasks

Anthropic Launches Cowork: Claude Code for Non-Technical Tasks

Cursor Adds Claude Fable 5, Sets New Coding Benchmark Record

Cursor Adds Claude Fable 5, Sets New Coding Benchmark Record

Keep reading

Claude Accelerates Scientific Discoveries at Three Research Labs

Claude Accelerates Scientific Discoveries at Three Research Labs

Claude Opus 4.8 takes top spot on agentic work benchmark

Claude Opus 4.8 takes top spot on agentic work benchmark

Anthropic Launches Cowork: Claude Code for Non-Technical Tasks

Anthropic Launches Cowork: Claude Code for Non-Technical Tasks

Cursor Adds Claude Fable 5, Sets New Coding Benchmark Record

Cursor Adds Claude Fable 5, Sets New Coding Benchmark Record