Retrieving Semantically Similar Decisions under Noisy Institutional Labels: Robust Comparison of Embedding Methods

Logo poskytovatele

Varování

Publikace nespadá pod Ústav výpočetní techniky, ale pod Právnickou fakultu. Oficiální stránka publikace je na webu muni.cz.
Autoři

NOVOTNÁ Tereza HARAŠTA Jakub

Rok publikování 2025
Druh Článek v odborném periodiku
Časopis / Zdroj arXiv
Fakulta / Pracoviště MU

Právnická fakulta

Citace
www Text (arXiv.org)
Doi https://doi.org/10.48550/arXiv.2512.05681
Klíčová slova legal information retrieval; case law; embeddings; evaluation under noisy labels; Czech Constitutional Court
Popis Retrieving case law is a time-consuming task predominantly carried out by querying databases. We provide a comparison of two models in three different settings for Czech Constitutional Court decisions: (i) a large general-purpose embedder (OpenAI), (ii) a domain-specific BERT-trained from scratch on ~30,000 decisions using sliding windows and attention pooling. We propose a noise-aware evaluation including IDF-weighted keyword overlap as graded relevance, binarization via two thresholds (0.20 balanced, 0.28 strict), significance via paired bootstrap, and an nDCG diagnosis supported with qualitative analysis. Despite modest absolute nDCG (expected under noisy labels), the general OpenAI embedder decisively outperforms the domain pre-trained BERT in both settings at @10/@20/@100 across both thresholds; differences are statistically significant. Diagnostics attribute low absolutes to label drift and strong ideals rather than lack of utility. Additionally, our framework is robust enough to be used for evaluation under a noisy gold dataset, which is typical when handling data with heterogeneous labels stemming from legacy judicial databases.
Související projekty:

Používáte starou verzi internetového prohlížeče. Doporučujeme aktualizovat Váš prohlížeč na nejnovější verzi.

Další info