Evaluating Bilingual Lexicon Induction without Lexical Data

Denisová,  Michaela; Rychlý,  Pavel

Evaluating Bilingual Lexicon Induction without Lexical Data

Varování

Publikace nespadá pod Ústav výpočetní techniky, ale pod Fakultu informatiky. Oficiální stránka publikace je na webu muni.cz.

Autoři	DENISOVÁ Michaela RYCHLÝ Pavel
Rok publikování	2025
Druh	Článek ve sborníku
Konference	Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing (RANLP)
Fakulta / Pracoviště MU	Fakulta informatiky
Citace
www	Plný text
Doi	https://doi.org/10.26615/978-954-452-098-4-034
Klíčová slova	bilingual lexicon induction; evaluation; cross-lingual embedding models
Popis	Bilingual Lexicon Induction (BLI) is a fundamental task in cross-lingual word embedding (CWE) evaluation, aimed at retrieving word translations from monolingual corpora in two languages. Despite the task’s central role, existing evaluation datasets based on lexical data often contain biases such as a lack of morphological diversity, frequency skew, semantic leakage, and overrepresentation of proper names, which undermine the validity of reported performance. In this paper, we propose a novel, language-agnostic evaluation methodology that entirely eliminates the dependency on lexical data. By training two sets of monolingual word embeddings (MWEs) using identical data and algorithms but with different weight initialisations, we enable the assessment on the BLI task without being affected by the quality of the evaluation dataset. We evaluate three baseline CWE models and analyse the impact of key hyperparameters. Our results provide a more reliable and bias-free perspective on CWE models’ performance.
Související projekty:	Umělá inteligence a správa komplexních rozsáhlých dat