Utok: The Fast Rule-based Tokenizer

Warning

This publication doesn't include Institute of Computer Science. It includes Faculty of Informatics. Official publication website can be found on muni.cz.
Authors

RYCHLÝ Pavel ŠPALEK Samuel

Year of publication 2022
Type Article in Proceedings
Conference Proceedings of the Sixteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2022
MU Faculty or unit

Faculty of Informatics

Citation
Web
Keywords tokenizer; tokenization; text processing
Description Tokenization is one of the first processing steps in most natural language processing applications. The papper introduces a new tokenizer Utok which follows the Unitok tokenizer in the form of simplicity of configuration for different languages and is much faster in processing speed.
Related projects:

You are running an old browser version. We recommend updating your browser to its latest version.

More info