Mining Relevant Text Documents Using Ranking-Based k-NN Algorithms Trained by Only Positive Examples

Hroza,  Jiří; Žižka,  Jan

Mining Relevant Text Documents Using Ranking-Based k-NN Algorithms Trained by Only Positive Examples

Warning

This publication doesn't include Institute of Computer Science. It includes Faculty of Informatics. Official publication website can be found on muni.cz.

Authors	HROZA Jiří ŽIŽKA Jan
Year of publication	2005
Type	Article in Proceedings
Conference	Znalosti 2005, sborník příspěvků
MU Faculty or unit	Faculty of Informatics
Citation
Field	Informatics
Keywords	ranking; text categorization; k-NN
Description	The problem of mining relevant information from large numbers of unstructured text documents is often handled with various machine learning algorithms trained using both positive and negative examples that were prepared by an expert in a~given specific domain. However, when just positive examples are available, the task requires algorithms adapted to the different situation. A~modified k-nearest neighbors algorithm, trained using only positive examples, can classify by way of ranking unlabeled instances depending on their similarity to training examples. This procedure provides a~significant part of unlabeled positive instances with high precision. The main objective is to find a~method for mining relevant documents from large volumes (hundreds or thousands) of similar medical text files. Experiments and comparisons with various real data obtained from several Internet resources and represented as a bag of words provided---under specific conditions---quite acceptable results from the precision-recall point of view.
Related projects:	Human-computer interaction, dialog systems and assistive technologies