Hyph-bench: Benchmark Dataset of Hyphenated Words for Generating Hyphenation Patterns

Metelka,  Ondřej; Sojka,  Petr

Hyph-bench: Benchmark Dataset of Hyphenated Words for Generating Hyphenation Patterns

Varování

Publikace nespadá pod Ústav výpočetní techniky, ale pod Fakultu informatiky. Oficiální stránka publikace je na webu muni.cz.

Autoři	METELKA Ondřej SOJKA Petr
Rok publikování	2025
Druh	Článek ve sborníku
Konference	Recent Advances in Slavonic Natural Language Processing, RASLAN 2025
Fakulta / Pracoviště MU	Fakulta informatiky
Citace
www	https://github.com/tondach01/hyph-bench paper PDF
Klíčová slova	hyphenation; dictionary problem; effectiveness; hyphenation patterns; Patgen; pattern generation; benchmark dataset; hyphenated wordlists; supervised learning
Popis	The hyphenation algorithm, based on hyphenation patterns and developed primarily for \TeX, is used almost exclusively by typesetting systems, web browsers, and other applications that require breaking text lines. The essence of minimal pattern generation lies in the NP-complete task of optimizing size and coverage in pattern generation from a word list of hyphenated words, while maintaining near 100\% accuracy. The problem of optimising pattern generation has already been studied for several Slavic languages; however, the heuristic setting of parameters for the pattern generation process is based on the quality and quantity of the hyphenated word list. We have designed and collected the datasets of a hyphenated word list of Slavic and non-Slavic languages with the primary goal of benchmarking and optimization of pattern generation by Patgen. We have set and computed baselines for included languages. We have prepared a benchmark dataset with baselines that often beat the currently widely used patterns in precision and/or recall. We are paving the road towards more accurate, smaller patterns with better and consistent coverage of word hyphenation in several Slavic and non-Slavic languages. The dataset also enables experiments in generating segmentation or hyphenations for language families, or even the universal, syllabic hyphenation patterns.
Související projekty:	Umělá inteligence a správa komplexních rozsáhlých dat