Can Out-of-Distribution Evaluations Uncover Reliance on Shortcuts? A Case Study in Question Answering

Varování

Publikace nespadá pod Ústav výpočetní techniky, ale pod Fakultu informatiky. Oficiální stránka publikace je na webu muni.cz.
Autoři

ŠTEFÁNIK Michal MICKUS Timothee SPIEGEL Michal KADLČÍK Marek KUCHAŘ Josef

Rok publikování 2025
Druh Článek ve sborníku
Konference Findings of the Association for Computational Linguistics: EMNLP 2025
Fakulta / Pracoviště MU

Fakulta informatiky

Citace
www https://aclanthology.org/2025.findings-emnlp.1232/
Klíčová slova generalization; robustness; evaluation; question answering; language models; natural language processing; machine learning
Popis A majority of recent work in AI assesses models' generalization capabilities through the lens of performance on out-of-distribution (OOD) datasets. Despite their practicality, such evaluations build upon a strong assumption: that OOD evaluations can capture and reflect upon possible failures in a real-world deployment. In this work, we challenge this assumption and confront the results obtained from OOD evaluations with a set of specific failure modes documented in existing question-answering (QA) models, referred to as a reliance on spurious features or prediction shortcuts. We find that different datasets used for OOD evaluations in QA provide an estimate of models' robustness to shortcuts that have a vastly different quality, some largely under-performing even a simple, in-distribution evaluation. We partially attribute this to the observation that spurious shortcuts are shared across ID+OOD datasets, but also find cases where a dataset's quality for training and evaluation is largely disconnected. Our work underlines limitations of commonly-used OOD-based evaluations of generalization, and provides methodology and recommendations for evaluating generalization within and beyond QA more robustly.
Související projekty:

Používáte starou verzi internetového prohlížeče. Doporučujeme aktualizovat Váš prohlížeč na nejnovější verzi.

Další info