Words’ Burstiness in Language Models

Varování

Publikace nespadá pod Ústav výpočetní techniky, ale pod Fakultu informatiky. Oficiální stránka publikace je na webu muni.cz.

Autoři	RYCHLÝ Pavel
Rok publikování	2011
Druh	Článek ve sborníku
Konference	Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2011
Fakulta / Pracoviště MU	Fakulta informatiky
Citace
www	https://nlp.fi.muni.cz/raslan/2011/paper17.pdf
Obor	Jazykověda
Klíčová slova	Burstiness; Language models; Words' probability
Popis	Good estimation of the probability of a single word is a crucial part of language modelling. It is based on raw frequency of the word in a training corpus. Such computation is a good estimation for functional words and most very frequent words, but it is a poor estimation for most content words because of words' tendency to occur in clusters. This paper provides an analysis of words' burstiness and propose a new unigram language model which handles bursty words much better. The evaluation of the model on two data sets shows consistently lower perplexity and cross-entropy in the new model.
Související projekty:	Právní e-slovník - PES Temporální aspekty znalostí a informací