Words’ Burstiness in Language Models
| Autoři | |
|---|---|
| Rok publikování | 2011 |
| Druh | Článek ve sborníku |
| Konference | Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2011 |
| Fakulta / Pracoviště MU | |
| Citace | |
| www | https://nlp.fi.muni.cz/raslan/2011/paper17.pdf |
| Obor | Jazykověda |
| Klíčová slova | Burstiness; Language models; Words' probability |
| Popis | Good estimation of the probability of a single word is a crucial part of language modelling. It is based on raw frequency of the word in a training corpus. Such computation is a good estimation for functional words and most very frequent words, but it is a poor estimation for most content words because of words' tendency to occur in clusters. This paper provides an analysis of words' burstiness and propose a new unigram language model which handles bursty words much better. The evaluation of the model on two data sets shows consistently lower perplexity and cross-entropy in the new model. |
| Související projekty: |