Corpus Factory

Warning

This publication doesn't include Institute of Computer Science. It includes Faculty of Informatics. Official publication website can be found on muni.cz.
Authors

KILGARRIFF Adam REDDY Siva POMIKÁLEK Jan

Year of publication 2009
Type Article in Proceedings
MU Faculty or unit

Faculty of Informatics

Citation
Web http://www.kilgarriff.co.uk/Publications/2009-KilgReddyPomikalek-asialex-CorpFactory.doc
Description State-of the art lexicography requires corpora, but for many languages there are no large, general-language corpora available. Until recently, all but the richest publishing houses could do little but shake their heads in dismay as corpus-building was long, slow and expensive. But with the advent of the Web it can be highly automated and thereby fast and inexpensive. We have developed a ‘corpus factory’ where we build lexicographic corpora. In this paper we describe the method we use, and how it has worked, and how various problems were solved, for five languages: Dutch, Hindi, Telugu, Thai and Vietnamese. The corpora we have developed are available for use in the Sketch Engine corpus query tool.
Related projects:

You are running an old browser version. We recommend updating your browser to its latest version.

More info