ann: Approximate nearest neighbor search routines

Description

Given a set of reference data points $S$, ann constructs a kd-tree or box-decomposition tree (bd-tree) for efficient $k$-nearest neighbor searches.

Usage

ann(ref, target, k=1, eps=0.0, tree.type="kd",
    search.type="standard", bucket.size=1, split.rule="sl_midpt",
    shrink.rule="simple", verbose=TRUE, ...)

Arguments

ref

an $n \times d$ matrix containing the reference point set $S$. Each row in ref corresponds to a point in $d$-dimensional space.

target

an $m \times d$ matrix containing the points for which $k$ nearest neighbor reference points are sought.

defines the number of nearest neighbors to find. The default is $k$=1.

eps

the $i^{th}$ nearest neighbor is at most (1+eps) from true $i^{th}$ nearest neighbor, where eps$\ge 0$ . Specifically, the true (not squared) difference between the true $i^{th}$ and the approximation of the $i^{th}$

tree.type

the data structures kd-tree or bd-tree as quoted key words kd and bd, respectively. A brute force search can be specified with the quoted key word brute. If brute is specified, then all subsequent arguments

search.type

either standard or priority search in the kd-tree or bd-tree, specified by quoted key words standard and priority, respectively. The default is the standard search.

bucket.size

the maximum number of reference points in the leaf nodes. The default is 1.

split.rule

is the strategy for the recursive splitting of those nodes with more points than the bucket size. The splitting rule applies to both the kd-tree and bd-tree. Splitting rule options are the quoted key words:

standard<

Value

An object of class ann, which is a list with some or all of the following tags:
knnIndexDistan $m \times 2k$ matrix. Each row corresponds to a target point in target and columns 1:$k$ hold the ref matrix row indices of the nearest neighbors, such that column 1 index holds the ref matrix row index for the first nearest neighbor and column $k$ is the $k^{th}$ nearest neighbor index. Columns $k+1$:2k hold the Euclidean distance from the target to each of the $k$ nearest neighbors indexed in columns 1:$k$.
searchTimetotal search time, not including data structure construction, etc.
kas defined in the ann function call.
epsas defined in the ann function call.
tree.typeas defined in the ann function call.
search.typeas defined in the ann function call.
bucket.sizeas defined in the ann function call.
split.ruleas defined in the ann function call.
shrink.ruleas defined in the ann function call.

item

shrink.rule
simple
centroid
verbose
...

code

tree.type

emph

itemize

none

Details

The ann function calls portions of the Approximate Nearest Neighbor Library, written by David M. Mount. All of the ann function arguments are detailed in the ANN Programming Manual found at http://www.cs.umd.edu/~mount/ANN.

Examples

Run this code

## Make a couple of bivariate normal classes
rmvn <- function(n, mu=0, V = matrix(1))
{
  p <- length(mu)
  if(any(is.na(match(dim(V),p))))
    stop("Dimension problem!")
  D <- chol(V)
  matrix(rnorm(n*p), ncol=p) %*% D + rep(mu,rep(n,p))
}

m <- 10000

## Class 1.
mu.1 <- c(20, 40)
V.1 <- matrix(c(-5,1,0,5),2,2); V.1 <- V.1%*%t(V.1)
c.1 <- cbind(rmvn(m, mu.1, V.1), rep(1, m))

## Class 2.
mu.2 <- c(30, 60)
V.2 <- matrix(c(4,2,0,2),2,2); V.2 <- V.2%*%t(V.2)
c.2 <- cbind(rmvn(m, mu.2, V.2), rep(2, m))

## Class 3.
mu.3 <- c(15, 60)
V.3 <- matrix(c(5,5,0,5),2,2); V.3 <- V.3%*%t(V.3)
c.3 <- cbind(rmvn(m, mu.3, V.3), rep(3, m))

c.all <- rbind(c.1, c.2, c.3)
max.x <- max(c.all[,1]); min.x <- min(c.all[,1])
max.y <- max(c.all[,2]); min.y <- min(c.all[,2])

## Check them out.
plot(c.1[,1], c.1[,2], xlim=c(min.x, max.x), ylim=c(min.y, max.y),
     pch=19, cex=0.5,
     col="blue", xlab="Variable 1", ylab="Variable 2")
points(c.2[,1], c.2[,2], pch=19, cex=0.5, col="green")
points(c.3[,1], c.3[,2], pch=19, cex=0.5, col="red")


## Take a reference sample.
n <- 2000
ref <- c.all[sample(1:nrow(c.all), n),]

## Compare search times
k <- 10
## Do a simple brute force search.
brute <- ann(ref=ref[,1:2], target=c.all[,1:2],
             tree.type="brute", k=k, verbose=FALSE)
print(brute$searchTime)

## Do an exact kd-tree search.
kd.exact <- ann(ref=ref[,1:2], target=c.all[,1:2],
                tree.type="kd", k=k, verbose=FALSE)
print(kd.exact$searchTime)

## Do an approximate kd-tree search.
kd.approx <- ann(ref=ref[,1:2], target=c.all[,1:2],
                 tree.type="kd", k=k, eps=100, verbose=FALSE)
print(kd.approx$searchTime)

## Takes too long to calculate for this many targets.
## Compare overall accuracy of the exact vs. approximate search
##knn.mode <- function(knn.indx, ref){
##  x <- ref[knn.indx,]
##  as.numeric(names(sort(as.matrix(table(x))[,1],
##                        decreasing=TRUE))[1])
##}

Run the code above in your browser using DataLab