Marko Tešić

Research Associate

Leverhulme Centre for the Future of Intelligence, University of Cambridge

I am a Research Associate at the Leverhulme Centre for the Future of Intelligence, University of Cambridge, where I focus on AI evaluation. My work includes assessing the validity of benchmarks, evaluating the cognitive abilities of large language models, and translating AI capabilities to job demands in the human workforce. Some of my research is supported by the OECD.

Previously, I was a Royal Academy of Engineering UK IC postdoctoral research fellow investigating the impact of explanations of AI predictions on our beliefs. I also studied people’s causal and probabilistic reasoning and have a strong interest in data analysis, causal modeling and Bayesian network analysis.

I received a Ph.D. in Psychology from Birkbeck’s Psychological Sciences department, an M.A. in Logic and Philosophy of Science from the Munich Center for Mathematical Philosophy, LMU and a B.A. in Philosophy from University of Belgrade, Serbia. See my CV for more info on my background, research and work experience.

I play the violin in Paprika: The Balkan and East European band.

Tutorials

Robust evaluation of generative AI

A tutorial on evaluating the capabilities of LLMs presented at the European Association for Data Science Summer School on Generative AI

Marko Tešić

Jun 20, 2024

Measurement layouts for capability-oriented AI evaluation

A tutorial presented at AAAI-24 on AI evaluation that focuses on estimating capabilities and creating capability profiles of AI systems (e.g., reinforcement learning agents and large language models) using a Bayesian framework.

John Burden, José Hernández-Orallo, Marko Tešić, Konstantinos Voudouris

Feb 20, 2024

Measurement layouts for capability-oriented AI evaluation

Publications

Marko Tešić, John Burden, Lorenzo Pacchiardi, José Hernández-Orallo

2025 Preprint

Paradigms of AI Evaluation: Mapping Goals, Methodologies and Culture

We analyse 125+ evaluation studies and identify six major paradigms of AI evaluation, each shaped by distinct goals, methodologies, and research cultures.

Marko Tešić, Janet Slesinski

2025 Center for Curriculum Redesign

Benchmark Design Criteria for Mathematical Reasoning in LLMs

I lay out key benchmark design criteria for evaluating mathematical reasoning in LLMs.

Marko Tešić, Lorenzo Pacchiardi, Lucy Cheke, José Hernández-Orallo

2024 Preprint

Leaving the barn door open for Clever Hans: Simple features predict LLM benchmark answers

We explore whether benchmarks can be solved using simple n-gram patterns and whether LLMs exploit these patterns to solve benchmark tasks.

Matteo Gabriel Mecattaf, Ben Slater, Marko Tešić, Jonathan Prunty, Konstantinos Voudouris, Lucy Cheke

2024 Preprint

A little less conversation, a little more action, please: Investigating the physical common-sense of LLMs in a 3D embodied environment

Evaluation of the physical common-sense reasoning abilities of LLMs (Claude 3.5 Sonnet, GPT-4o, and Gemini 1.5 Pro) by embedding them in a 3D environment (Animal-AI Testbed) and comparing their performance to other agents and human children.

Rakshit S. Trivedi, Akbir Khan, Jesse Clifton, Lewis Hammond, Edgar A. Duéñez-Guzmán, John P. Agapiou, Jayd Matyas, Sasha Vezhnevets, Dipam Chakraborty, Yue Zhao, Marko Tešić, Barna Pásztor, Yunke Ao, Omar G. Younis, Jiawei Huang, Benjamin Swain, Haoyuan Qin, Mian Deng, Ziwei Deng, Utku Erdoğanaras, Natasha Jaques, Jakob Nicolaus Foerster, Vincent Conitzer, José Hernández-Orallo, Dylan Hadfield-Menell, Joel Z. Leibo

2024 NeurIPS 2024 Track on Datasets and Benchmarks

Melting Pot Contest: Charting the Future of Generalized Cooperative Intelligence

An analysis of the design and outcomes of the Melting Pot competition, which measures agents’ ability to cooperate with others. We developed cognitive profiles for the agents submitted to the competition.

Rafael Fuchs, Marko Tešić, Ulrike Hahn

2024 Proceedings of the 46th Annual Meeting of the Cognitive Science Society

Testing the maximum entropy approach to awareness growth in Bayesian epistemology and decision theory

Applying the Maximum Entropy approach to awareness growth in the Bayesian framework, i.e. incorporating new events that we previously did not consider possible.

Marko Tešić, Ulrike Hahn

2023 Cognition

The impact of explanations as communicative acts on belief in a claim: The role of source reliability

Investigating the effects of (good) explanations and the explainer’s reliability on our beliefs in what is being explained.

Ulrike Hahn, Marko Tešić

2023 Philosophical Transactions of the Royal Society A

Argument and explanation

We bring together two closely related, but distinct, notions: argument and explanation. We provide a review of relevant research on these notions, drawn both from the cognitive science and the artificial intelligence (AI) literatures. We identify key directions for future research, indicating areas where bringing together cognitive science and AI perspectives would be mutually beneficial.

Marko Tešić, Ulrike Hahn

2022 Patterns

Can counterfactual explanations of AI systems’ predictions skew lay users’ causal intuitions about the world? If so, can we correct for that?

We explore some of the undesirable effects of providing explanations of AI systems to human users and ways to mitigate such effects. We show how providing counterfactual explanations of AI systems’ predictions unjustifiably changes people’s beliefs about causal relationships in the real world. We also show how health warning style messaging can prevent such a change in beliefs.

Marko Tešić

2021 Human Centered AI workshop at NeurIPS 2021

On the transferability of insights from the psychology of explanation to explainable AI

A discussion of the consequences of directly applying the insights from the psychology of explanation (that mostly focuses on causal explanations) to explainable AI (where most AI systems are based on associations).

Marko Tešić, Ulrike Hahn

2021 Human-Like Machine Intelligence

Explanation in AI systems

What do we do with our existing models when we encounter new variables to consider? Does the order in which we learn variables matter? The paper investigates two modeling strategies and experimentally tests how people reason when presented with new variables and in different orders.

Marko Tešić, Alice Liefgreen, David Lagnado

2020 Cognitive Psychology

The propensity interpretation of probability and diagnostic split in explaining away

Empirical testing of the effects of the propensity interpretation of probability and ‘diagnostic split’ reasoning in the context of explaining away.

Nicole Cruz, Saoirse Connor Desai, Stephen Dewitt, Ulrike Hahn, David Lagnado, Alice Liefgreen, Kirsty Phillips, Toby Pilditch, Marko Tešić

2020 Frontiers in Psychology

Widening Access to Bayesian Problem Solving

An experimental exploration of whether a Bayesian network modeling tool helps lay people to find correct solutions to complex problems.

Marko Tešić, Ulrike Hahn

2019 Proceedings of the 41th Annual Meeting of the Cognitive Science Society

Sequential diagnostic reasoning with independent causes

Alice Liefgreen, Marko Tešić, David Lagnado

2018 Proceedings of the 40th Annual Meeting of the Cognitive Science Society

Explaining away: Significance of priors, diagnostic reasoning, and structural complexity

Investigating people’s reasoning in explaining away situations by manipulating the priors of causes and the structural complexity of the causal Baeysian networks.

Marko Tešić, Benjamin Eva, Stephan Hartmann

2017 Preprint

Confirmation by Explanation: A Bayesian Justification of IBE

A justification for Inference to the Best Explanation (IBE) is provided by identifying conditions under which the best explanation of evidence can offer a confirmatory boost to the hypotheses under consideration.

Marko Tešić

2017 Synthese

Confirmation and the Generalized Nagel-Schaffner Model of Reduction: A Bayesian Analysis

Analyzing confirmation between theories in cases of intertheoretic reduction (e.g. reducing thermodynamics to statistical mechanics) using Bayesian networks.

Past Projects

Mistral AI Hackathon - CompanionAI

I developed a conversational companion for elderly individuals and those with memory challenges. The companion, implemented as a Telegram bot and using Mistral AI LLMs as backend, maintains conversational history and is designed to be empathetic toward the user.

Workshop on Human Behavioral Aspects of (X)AI

I organised a workshop that brought together researchers from machine learning and cognitive science to discuss the behavioral aspects of explainable AI.

(Un)interesting correlations: What are the chances that correlations lead to causation?

We use directed acyclic graphs (DAGs) to investigate the chances that two variables are causally connected, correlated, and that a covariate is inducing a correlation when controlled for.

Turing Data Study Group: Optimising the supply chain to minimise waste and delivery mileage

I worked on predicting deliveries to stores such that waste is minimised.

(Causal) Bayesian modeling of investment factors and Environmental, Social and Governance (ESG) criteria

As part of the BlackRock’s Factor Based Strategies Group I worked on understanding how some ESG criteria such as carbon emissions can impact return on equity.

Contact

My email address is marko dot tesic375 little monkey gmail dot com.