Benchmark Design Criteria for Mathematical Reasoning in LLMs

Abstract

As AI models increasingly tackle complex tasks, evaluating their mathematical reasoning capabilities has become essential. However, designing effective benchmarks that accurately assess a model’s reasoning abilities in mathematics requires careful consideration of various parameters. This paper outlines key aspects in developing robust benchmarks for evaluating large language models (LLMs) in mathematical reasoning, highlights limitations of existing assessments, and proposes criteria for comprehensive evaluations.

Publication
Center for Curriculum Redesign
Marko Tešić
Marko Tešić
Research Associate