1

Leaving the barn door open for Clever Hans: Simple features predict LLM benchmark answers

We explore whether benchmarks can be solved using simple n-gram patterns and whether LLMs exploit these patterns to solve benchmark tasks.

Marko Tešić, Lorenzo Pacchiardi, Lucy Cheke, José Hernández-Orallo

Leaving the barn door open for Clever Hans: Simple features predict LLM benchmark answers

A little less conversation, a little more action, please: Investigating the physical common-sense of LLMs in a 3D embodied environment

Evaluation of the physical common-sense reasoning abilities of LLMs (Claude 3.5 Sonnet, GPT-4o, and Gemini 1.5 Pro) by embedding them in a 3D environment (Animal-AI Testbed) and comparing their performance to other agents and human children.

Matteo Gabriel Mecattaf, Ben Slater, Marko Tešić, Jonathan Prunty, Konstantinos Voudouris, Lucy Cheke

Melting Pot Contest: Charting the Future of Generalized Cooperative Intelligence

An analysis of the design and outcomes of the Melting Pot competition, which measures agents’ ability to cooperate with others. We developed cognitive profiles for the agents submitted to the competition.

Rakshit S. Trivedi, Akbir Khan, Jesse Clifton, Lewis Hammond, Edgar A. Duéñez-Guzmán, John P. Agapiou, Jayd Matyas, Sasha Vezhnevets, Dipam Chakraborty, Yue Zhao, Marko Tešić, Barna Pásztor, Yunke Ao, Omar G. Younis, Jiawei Huang, Benjamin Swain, Haoyuan Qin, Mian Deng, Ziwei Deng, Utku Erdoğanaras, Natasha Jaques, Jakob Nicolaus Foerster, Vincent Conitzer, José Hernández-Orallo, Dylan Hadfield-Menell, Joel Z. Leibo