Utok: The Fast Rule-based Tokenizer
| Autoři | |
|---|---|
| Rok publikování | 2022 |
| Druh | Článek ve sborníku |
| Konference | Proceedings of the Sixteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2022 |
| Fakulta / Pracoviště MU | |
| Citace | |
| www | |
| Klíčová slova | tokenizer; tokenization; text processing |
| Popis | Tokenization is one of the first processing steps in most natural language processing applications. The papper introduces a new tokenizer Utok which follows the Unitok tokenizer in the form of simplicity of configuration for different languages and is much faster in processing speed. |
| Související projekty: |