Hyph-bench: Benchmark Dataset of Hyphenated Words for Generating Hyphenation Patterns
| Autoři | |
|---|---|
| Rok publikování | 2025 |
| Druh | Článek ve sborníku |
| Konference | Recent Advances in Slavonic Natural Language Processing, RASLAN 2025 |
| Fakulta / Pracoviště MU | |
| Citace | |
| www | |
| Klíčová slova | hyphenation; dictionary problem; effectiveness; hyphenation patterns; Patgen; pattern generation; benchmark dataset; hyphenated wordlists; supervised learning |
| Popis | The hyphenation algorithm, based on hyphenation patterns and developed primarily for \TeX, is used almost exclusively by typesetting systems, web browsers, and other applications that require breaking text lines. The essence of minimal pattern generation lies in the NP-complete task of optimizing size and coverage in pattern generation from a word list of hyphenated words, while maintaining near 100\% accuracy. The problem of optimising pattern generation has already been studied for several Slavic languages; however, the heuristic setting of parameters for the pattern generation process is based on the quality and quantity of the hyphenated word list. We have designed and collected the datasets of a hyphenated word list of Slavic and non-Slavic languages with the primary goal of benchmarking and optimization of pattern generation by Patgen. We have set and computed baselines for included languages. We have prepared a benchmark dataset with baselines that often beat the currently widely used patterns in precision and/or recall. We are paving the road towards more accurate, smaller patterns with better and consistent coverage of word hyphenation in several Slavic and non-Slavic languages. The dataset also enables experiments in generating segmentation or hyphenations for language families, or even the universal, syllabic hyphenation patterns. |
| Související projekty: |