Details
- Scale AI has launched the Multilingual Native Reasoning Challenge (MultiNRC), a benchmark created directly in French, Spanish, and Chinese to assess AI models on cultural and linguistic reasoning, not just translation.
- Developed in collaboration with native speakers, the benchmark includes over 1,000 questions spanning four categories: linguistic reasoning, wordplay and riddles, cultural and tradition-based reasoning, and culturally relevant math.
- Only questions that at least three out of five leading language models failed made the cut. MultiNRC was tested on 14 state-of-the-art large language models, with the best achieving just 49% accuracy; math questions proved especially difficult, with an average score of 23.3%.
- Unlike most existing benchmarks, MultiNRC is not a translation from English but built natively, capturing nuances such as French grammatical gender or the cultural idioms of Chinese and Spanish.
- An experiment found that translating cultural reasoning questions into English did not boost model performance, underscoring that cultural understanding is the barrier for AI, not just language conversion.
Impact
MultiNRC highlights the limitations of current multilingual AI, which often falters on culturally grounded tasks despite strong translation abilities. By making this benchmark public, Scale AI pushes the industry to develop genuinely equitable models that respect cultural context. This move underscores a shift toward richer benchmarks and could shape the next generation of global AI systems.