Anthropic Launches BioMysteryBench Eval Showing Claude Solves 30% of Expert-Stumping Bio Problems

Details

Anthropic released BioMysteryBench, a new bioinformatics benchmark with 99 real-world problems using messy biological datasets to test AI's creative research solutions.
Experts were stumped on 23 problems; Anthropic's latest Claude models, including Mythos, solved about 30% of those and most others.
Benchmark features method-agnostic evaluation, objective ground-truth answers from data properties, and superhuman questions humans couldn't solve.
Domain experts solved 76 of 99 tasks; Claude matches or exceeds panels of five experts on many, using diverse strategies.
Capabilities improved rapidly across Claude generations, outperforming humans on some human-difficult bioinformatics tasks like organism identification from crystal structures or viral detection via RNA-seq.

Impact

Anthropic's BioMysteryBench positions Claude as a leader in AI-driven bioinformatics, solving 30% of tasks that stumped expert panels—outpacing rivals like GPT-4o and Claude 3.5 Sonnet, which scored only 17% on similar BixBench open-answer tasks from early 2025. This advances autonomous scientific discovery, potentially accelerating biological research by handling noisy, open-ended problems beyond human limits. It pressures competitors to enhance agentic capabilities in specialized domains, widening AI's edge in biotech applications amid growing demand for reliable scientific tools.

Anthropic Launches BioMysteryBench Eval Showing Claude Solves 30% of Expert-Stumping Bio Problems

Details

Impact

Social

CONTENT

INFO