3

AI Workforce Suitability: Evaluating LLM Capabilities for Workforce Tasks

The project develops a framework for assessing the suitability of large language models for workforce tasks. It links annotated AI benchmarks, model performance evaluations, LLM capability profiles, and human work activities to estimate how suitable different LLMs are for specific job domains. The framework can be used to inform decisions about which models to pilot for particular workforce activities. Presentation describing the project and results can be found here.

Marko Tešić, Patrick Quinn, Jonathan Prunty, José Hernández-Orallo, Lucy Cheke

AI Workforce Suitability: Evaluating LLM Capabilities for Workforce Tasks

Robust evaluation of generative AI

A tutorial on evaluating the capabilities of LLMs presented at the European Association for Data Science Summer School on Generative AI

Marko Tešić

Jun 20, 2024

Robust evaluation of generative AI

Measurement layouts for capability-oriented AI evaluation

A tutorial presented at AAAI-24 on AI evaluation that focuses on estimating capabilities and creating capability profiles of AI systems (e.g., reinforcement learning agents and large language models) using a Bayesian framework.

John Burden, José Hernández-Orallo, Marko Tešić, Konstantinos Voudouris

Feb 20, 2024

Measurement layouts for capability-oriented AI evaluation