Are We There Yet? A Thorough Evaluation of POS Tagging on Czech

Varování

Publikace nespadá pod Ústav výpočetní techniky, ale pod Fakultu informatiky. Oficiální stránka publikace je na webu muni.cz.
Autoři

OHLÍDALOVÁ Vlasta JAKUBÍČEK Miloš RYCHLÝ Pavel

Rok publikování 2025
Druh Článek ve sborníku
Konference Text, Speech, and Dialogue, 28th International Conference, TSD 2025
Fakulta / Pracoviště MU

Fakulta informatiky

Citace
www Konferenční sborník
Doi https://doi.org/10.1007/978-3-032-02551-7_23
Klíčová slova morphological analysis; evaluation; POS tagging
Popis With recent advances in natural language processing, part-of-speech (POS) tagging is one of the areas that has seen significant improvements. Contemporary state-of-the-art tools report accuracies approaching 100% even for morphologically rich languages such as Czech that used to pose a challenge in the past. In this study, we investigate whether such accuracy is reproducible on real-world data, as previous research has demonstrated substantial discrepancies between evaluations conducted on gold-standard corpora and those based on text typically occurring on the web. To address this issue, we selected a set of widely used and well-established POS taggers and applied them to a random sample of documents from the csTenTen23 web corpus. Tokens, for which the taggers produced differing outputs, were then manually annotated. Our results indicate that the ability of modern POS taggers to handle real-world data – including a broad range of genres and topics – has improved significantly in comparison to the earlier statistically based POS taggers. Furthermore, we observe a shift in the most problematic tagging category: whereas case assignment was previously a major source of errors, the best current models struggle more with POS category distinctions. We argue that this shift may reflect ambiguities inherent in the POS category itself, where even human annotators may not fully agree.
Související projekty:

Používáte starou verzi internetového prohlížeče. Doporučujeme aktualizovat Váš prohlížeč na nejnovější verzi.

Další info