Title: Fast Nearest Neighbor Methods Author: Thomas Breuel Institution: UniKL
import cv2
Fast Nearest Neighbor Methods:
practical:
approaches:
Nearest Neighbor vs Range Query:
Database:
Range Query:
Nearest Neighbor Query:
Dimensionality Problem:
As we saw at the beginning, in high dimensions, all distances become very similar to each other for data with an intrinsic high dimension.
Finding the exact nearest neighbor reduces to linear search for all known algorithms.
Solutions to Dimensionality Problem:
Purpose of Nearest Neighbor:
Almost always, the purpose of $k$-nearest neighbor searches is classification.
The classification problem may be implicit or dynamic, but it still relies on the relationship between nonparametric density estimation and nearest neighbors.
Linear Search:
possible speedups:
Range Query with Infinity Metric:
In the $\infty$-metric, the distance between two points is the maximum distance in any coordinate.
For a range query, we require that every coordinate of a match is within the given $\epsilon$ of the query point.
We can execute this query coordinate by coordinate, even in a relational database:
select * from t where abs(t.x1-1.7)<0.1 and abs(t.x2-3.9)<0.1 and abs(t.x3-0.2)<1
Internally, this can be executed using inverted sorted indexes and intersections.
kD-Tree:
construction:
retrieval:
kD-tree complexity:
In practice, kD-trees do not work for searching in high dimensions.
Locality Sensitive Hashing
Assume a family of hash functions $F$, a distance $d$, a distance threshold $R>0$, and an approximation factor $c>1$.
For every $h\in F$, and with $p_1\gt p_2$:
We call this an $(R,cR,p_1,p_2)$-sensitive family of hash functions.
Amplification:
AND-construction:
Given a $(d_1,d_2,p_1,p_2)$-sensitive family of hash functions, choose $k$ random hash functions $h_1,...,h_k$ and define a new hash function $h$ that has to agree on all of them. This gives rise to a $(d_1,d_2,p_1^r,p_2^r)$-sensitive family.
OR-construction:
Given a $(d_1,d_2,p_1,p_2)$-sensitive family of hash functions, choose $k$ random hash functions $h_1,...,h_k$ and define a new hash function $h$ that has to agree on any of them. This gives rise to a $(d_1,d_2,1-(1-p_1)^r,1-(1-p_2)^r)$-sensitive family.
Approximate nearest neighbor search using LSH:
Assume a locality sensitive family $F$ of hash functions, and parameters $k$ and $L$.
During preprocessing:
During query:
Complexity:
$\rho=\log P_2 / \log P_1$, assuming constant time hash functions; finds approximate nearest neighbors
Which random hash functions?
Hierarchical k-Means
Construction:
Lookup:
Notes:
data = array(randn(1000,10),'f')
flann = cv2.flann_Index(data,dict(algorithm=1,trees=4))
index,dist = flann.knnSearch(data,5,params={})
print index[:10]
[[ 0 638 899 940 525] [ 1 944 749 230 497] [ 2 430 409 933 928] [ 3 79 248 788 951] [ 4 103 58 507 480] [ 5 829 696 150 955] [ 6 588 981 990 28] [ 7 361 780 722 107] [ 8 722 707 666 685] [ 9 287 163 125 272]]
print dist[:10]
[[ 0. 2.46911836 2.62438083 3.03617215 3.18475151] [ 0. 4.27013683 4.83362865 6.21951389 6.43545055] [ 0. 4.56065941 4.59481716 8.74168491 9.18355083] [ 0. 5.56775475 6.28863955 6.48139763 6.63649178] [ 0. 4.99117756 5.16120672 5.90600395 6.00450516] [ 0. 4.00080204 5.02649212 5.42283583 5.60694599] [ 0. 4.98363495 5.5444622 5.70466614 6.22260761] [ 0. 9.51819229 11.93401909 12.31844997 12.35668182] [ 0. 4.21605206 4.53588057 4.59677792 4.67584229] [ 0. 4.1922102 4.80881023 4.81959963 5.81781578]]