When Tesseract Does It Alone: Optical Character Recognition of Medieval Texts

Varování

Publikace nespadá pod Ústav výpočetní techniky, ale pod Fakultu informatiky. Oficiální stránka publikace je na webu muni.cz.

Autoři	NOVOTNÝ Vít
Rok publikování	2020
Druh	Článek ve sborníku
Konference	Proceedings of the Fourteenth Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2020
Fakulta / Pracoviště MU	Fakulta informatiky
Citace
www	Domovská stránka workshopu PDF
Klíčová slova	Optical character recognition; OCR; Historical texts
Popis	Optical character recognition of scanned images for contemporary printed texts is widely considered a solved problem. However, the optical character recognition of early printed books and reprints of Medieval texts remains an open challenge. In our work, we present a dataset of 19th and 20th century letterpress reprints of documents from the Hussite era (1419–1436) and perform a quantitative and qualitative evaluation of speed and accuracy on six existing OCR algorithms. We conclude that the Tesseract family of OCR algoritms is the fastest and the most accurate on our dataset, and we suggest improvements to our dataset.
Související projekty:	Aplikovaný výzkum: softwarové architektury kritických infrastruktur, bezpečnost počítačových systémů, zpracování přirozeného jazyka a jazykové inženýrství, vizualizaci velkých dat a rozšířená realita. Zapojení studentů Fakulty informatiky do mezinárodní vědecké komunity 20