Tailored Fine-Tuning For The Comma Insertion In Czech

Warning

This publication doesn't include Institute of Computer Science. It includes Faculty of Arts. Official publication website can be found on muni.cz.

Authors	MACHURA Jakub ŽIŽKOVÁ Hana STANO Patrik VRABCOVÁ Tereza HLAVÁČKOVÁ Dana TRNOVEC Ondřej
Year of publication	2025
Type	Article in Periodical
Magazine / Source	Jazykovedný časopis
MU Faculty or unit	Faculty of Arts
Citation
web	https://www.juls.savba.sk/ediela/jc/2025/1/jc25-01.pdf
Doi	https://doi.org/10.2478/jazcas-2025-0024
Keywords	comma; Czech language; Fine-tuning; Large Language Model (LLM)
Description	Transfer learning techniques, particularly the use of pre-trained Transformers, can be trained on vast amounts of text in a particular language and can be tailored to specific grammar correction tasks, such as automatic punctuation correction. The Czech pre-trained RoBERTa model demonstrates outstanding performance in this task (Machura et al. 2022); however, previous attempts to improve the model have so far led to a slight degradation (Machura et al. 2023). In this paper, we present a more targeted fine-tuning of this model, addressing linguistic phenomena that the base model overlooked. Additionally, we provide a comparison with other models trained on a more diverse dataset beyond just web texts.
Related projects:	Oscars - Opravidlo 2.0 – Public Online Proofreading Service