Nearest-neighbor Search from Large Datasets using Narrow Sketches

Naoya, Higuchi; Yasunobu, Imamura; Míč,  Vladimír; Takeshi, Shinohara; Kouichi, Hirata; Tetsuji, Kuboyama

Nearest-neighbor Search from Large Datasets using Narrow Sketches

Warning

This publication doesn't include Institute of Computer Science. It includes Faculty of Informatics. Official publication website can be found on muni.cz.

Authors	NAOYA Higuchi YASUNOBU Imamura MÍČ Vladimír TAKESHI Shinohara KOUICHI Hirata TETSUJI Kuboyama
Year of publication	2022
Type	Article in Proceedings
Conference	Proceedings of the 11th International Conference on Pattern Recognition Applications and Methods - ICPRAM
MU Faculty or unit	Faculty of Informatics
Citation
web	https://www.scitepress.org/PublicationsDetail.aspx?ID=s5xL4A2YSOs=&t=1
Doi	https://doi.org/10.5220/0010817600003122
Keywords	Narrow Sketch;Nearest-neighbor Search;Large Dataset;Sketch Enumeration;Partially Restored Distance
Description	We consider the nearest-neighbor search on large-scale high-dimensional datasets that cannot fit in the main memory. Sketches are bit strings that compactly express data points. Although it is usually thought that wide sketches are needed for high-precision searches, we use relatively narrow sketches such as 22-bit or 24-bit, to select a small set of candidates for the search. We use an asymmetric distance between data points and sketches as the criteria for candidate selection, instead of traditionally used Hamming distance. It can be considered a distance partially restoring quantization error. We utilize an efficient one-by-one sketch enumeration in the order of the partially restored distance to realize a fast candidate selection. We use two datasets to demonstrate the effectiveness of the method: YFCC100M-HNfc6 consisting of about 100 million 4,096 dimensional image descriptors and DEEP1B consisting of 1 billion 96 dimensional vectors. Using a standard desktop computer, we condu cted a nearest-neighbor search for a query on datasets stored on SSD, where vectors are represented by 8-bit integers. The proposed method executes the search in 5.8 seconds for the 400GB dataset YFCC100M, and 0.24 seconds for the 100GB dataset DEEP1B, while keeping the recall of 90%.
Related projects:	CyberSecurity, CyberCrime and Critical Information Infrastructures Center of Excellence