Details
- Scale AI has introduced VisualToolBench (VTB), a new benchmark designed to assess AI models’ capabilities in actively manipulating images to solve complex problems—referred to as "thinking with images"—as opposed to merely analyzing them. No evaluated model surpassed 19% accuracy, with the majority scoring under 10%.
- Frontline models such as GPT-5, o3, and Gemini-2.5-pro were tested on 1,204 tasks with 2,893 images across five diverse domains. These tasks, crafted by human experts, required models to perform various image manipulation actions such as cropping, rotating, or enhancing, spanning STEM, medicine, finance, sports, and general knowledge areas. Evaluation included both single-step and multi-turn interaction scenarios.
- VTB uses rigorous pass/fail grading, measuring visual understanding, factual accuracy, instruction compliance, reasoning, and output clarity. Only tasks that were unsolved by at least two out of three top-tier models made the final set, ensuring meaningful difficulty.
- The release addresses increasing concerns about “benchmark saturation” as noted in Stanford’s 2025 AI Index Report, which found many legacy benchmarks insufficient for distinguishing next-generation system capabilities. Prior Scale AI benchmarks like SWE-Bench Pro have revealed similar steep drops in model performance under realistic conditions.
- Error analysis showed that between 71% and 82% of failures were due to visual perception mistakes, not reasoning errors. Notably, using more image-editing tools didn’t guarantee better results: GPT-5 used fewer tool operations than o3 but more effectively, while Gemini-2.5-pro achieved slightly better accuracy without tool use altogether.
Impact
VisualToolBench reveals a persistent shortfall in multimodal AI: while top models handle language tasks well, they falter at dynamically interpreting and manipulating visual information. As sectors like medicine and engineering demand such active visual problem-solving, the benchmark signals the need for new architectures focused on perceptual and agentic abilities—suggesting that scaling up alone will not close this critical gap for tomorrow’s AI systems.