Laura Heinrich-Litan:
Exact L_infinity Nearest Neighbor Search in High Dimensions
Kurzbeschreibung
In this thesis we consider the nearest-neighbor problem, which is defined as follows: given a fixed set P of n data points in some metric space X, build a data structure such that for each given query point q a data point from P closest to q can be found efficiently. The underlying metric space is usually the d-dimensional real space Rd together with one of the Lp-metrics, 1<= p <=∞. In many applications, the dimension d of the search space is quite high and can reach several hundreds or even several thousands. Therefore, running times and storage requirements exponential in d are prohibitive in these cases. Because of their exponential dependence on the dimension, all known techniques for exact nearest-neighbor problem are in fact in high dimensions not competitive with the brute-force method, which just determines the distance of q to each point in P and selects the minimum.
This thesis presents algorithms for solving the high-dimensional exact nearest-neighbor problem with respect to the L∞-distance. We analyze the average-case situation when the data points are chosen independently at random under uniform distribution. The algorithms considerably improve the brute-force method, they are simple and easy to implement.
In Chapter 2 we consider query algorithms that need no preprocessing and require storage only for the point set P. Their average running time is O( n+(nd / ln(n)) ).
In Chapter 3 we present two strategies which speed up the search by using preprocessing. The query algorithm introduced in Section 3.1.2 requires linear storage and has an expected running time of O(n ln(d / ln( n)+1)+n). The data structure developed in Section 3.2 is based on a preprocessed partition of the data set into sequences, which are monotone with respect to some of the dimensions. The query algorithm has an expected running time of O( √dn1-1/√dln(n)) for dimensions d<(ln(n)/ln(ln(n)))2.
Chapter 4 presents several generalizations, in particular to the important problem of finding the k nearest neighbors to a query point. We generalize the analysis of the considered algorithms to other "well-behaved" probability distributions. Furthermore, we develop extensions of the algorithms which work efficiently in the external-memory model of computation.
In Chapter 5 we present a method which provides tradeoffs between the space complexity of the data structure and the time complexity of the query algorithm.