ann: Approximate nearest neighbor search routines

Description

Given a set of reference data points \(S\), ann constructs a kd-tree or box-decomposition tree (bd-tree) for efficient \(k\)-nearest neighbor searches.

Usage

ann(ref, target, k=1, eps=0.0, tree.type="kd",
    search.type="standard", bucket.size=1, split.rule="sl_midpt",
    shrink.rule="simple", verbose=TRUE, ...)

Value

An object of class ann, which is a list with some or all of the following tags:

knnIndexDist: an \(m \times 2k\) matrix. Each row corresponds to a target point in target and columns 1:\(k\) hold the ref matrix row indices of the nearest neighbors, such that column 1 index holds the ref matrix row index for the first nearest neighbor and column \(k\) is the \(k^{th}\) nearest neighbor index. Columns \(k+1\):2k hold the Euclidean distance from the target to each of the \(k\) nearest neighbors indexed in columns 1:\(k\).
searchTime: total search time, not including data structure construction, etc.
k: as defined in the ann function call.
eps: as defined in the ann function call.
tree.type: as defined in the ann function call.
search.type: as defined in the ann function call.
bucket.size: as defined in the ann function call.
split.rule: as defined in the ann function call.
shrink.rule: as defined in the ann function call.

Arguments

ref

an \(n \times d\) matrix containing the reference point set \(S\). Each row in ref corresponds to a point in \(d\)-dimensional space.

target

an \(m \times d\) matrix containing the points for which \(k\) nearest neighbor reference points are sought.

k

defines the number of nearest neighbors to find. The default is \(k\)=1.

eps

the \(i^{th}\) nearest neighbor is at most (1+eps) from true \(i^{th}\) nearest neighbor, where eps\(\ge 0\) . Specifically, the true (not squared) difference between the true \(i^{th}\) and the approximation of the \(i^{th}\) point is a factor of (1+eps). The default value of eps=0 is an exact search.

tree.type

the data structures kd-tree or bd-tree as quoted key words kd and bd, respectively. A brute force search can be specified with the quoted key word brute. If brute is specified, then all subsequent arguments are ignored. The default is the kd-tree.

search.type

either standard or priority search in the kd-tree or bd-tree, specified by quoted key words standard and priority, respectively. The default is the standard search.

bucket.size

the maximum number of reference points in the leaf nodes. The default is 1.

split.rule

is the strategy for the recursive splitting of those nodes with more points than the bucket size. The splitting rule applies to both the kd-tree and bd-tree. Splitting rule options are the quoted key words:

standard - standard kd-tree
midpt - midpoint
fair - fair-split
sl\_midpt - sliding-midpoint (default)
sl{fair - fair-split rule

See supporting documentation, reference below, for a thorough description and discussion of these splitting rules.

shrink.rule

applies only to the bd-tree and is an additional strategy (beyond the splitting rule) for the recursive partitioning of nodes. This argument is ignored if tree.type is specified as kd. Shrinking rule options are quoted key words:

none - equivalent to the kd-tree
simple - simple shrink (default)
centroid - centroid shrink

See supporting documentation, reference below, for a thorough description and discussion of these shrinking rules.

verbose

if true, search progress is printed to the screen.

...

currently no additional arguments.

Author

Andrew O. Finley finleya@msu.edu

Details

The ann function calls portions of the Approximate Nearest Neighbor Library, written by David M. Mount. All of the ann function arguments are detailed in the ANN Programming Manual found at http://www.cs.umd.edu/~mount/ANN.

Examples

Run this code


## Make a couple of bivariate normal classes
rmvn <- function(n, mu=0, V = matrix(1))
{
  p <- length(mu)
  if(any(is.na(match(dim(V),p))))
    stop("Dimension problem!")
  D <- chol(V)
  matrix(rnorm(n*p), ncol=p) %*% D + rep(mu,rep(n,p))
}

m <- 10000

## Class 1.
mu.1 <- c(20, 40)
V.1 <- matrix(c(-5,1,0,5),2,2); V.1 <- V.1%*%t(V.1)
c.1 <- cbind(rmvn(m, mu.1, V.1), rep(1, m))

## Class 2.
mu.2 <- c(30, 60)
V.2 <- matrix(c(4,2,0,2),2,2); V.2 <- V.2%*%t(V.2)
c.2 <- cbind(rmvn(m, mu.2, V.2), rep(2, m))

## Class 3.
mu.3 <- c(15, 60)
V.3 <- matrix(c(5,5,0,5),2,2); V.3 <- V.3%*%t(V.3)
c.3 <- cbind(rmvn(m, mu.3, V.3), rep(3, m))

c.all <- rbind(c.1, c.2, c.3)
max.x <- max(c.all[,1]); min.x <- min(c.all[,1])
max.y <- max(c.all[,2]); min.y <- min(c.all[,2])

## Check them out.
plot(c.1[,1], c.1[,2], xlim=c(min.x, max.x), ylim=c(min.y, max.y),
     pch=19, cex=0.5,
     col="blue", xlab="Variable 1", ylab="Variable 2")
points(c.2[,1], c.2[,2], pch=19, cex=0.5, col="green")
points(c.3[,1], c.3[,2], pch=19, cex=0.5, col="red")


## Take a reference sample.
n <- 2000
ref <- c.all[sample(1:nrow(c.all), n),]

## Compare search times
k <- 10
## Do a simple brute force search.
brute <- ann(ref=ref[,1:2], target=c.all[,1:2],
             tree.type="brute", k=k, verbose=FALSE)
print(brute$searchTime)

## Do an exact kd-tree search.
kd.exact <- ann(ref=ref[,1:2], target=c.all[,1:2],
                tree.type="kd", k=k, verbose=FALSE)
print(kd.exact$searchTime)

## Do an approximate kd-tree search.
kd.approx <- ann(ref=ref[,1:2], target=c.all[,1:2],
                 tree.type="kd", k=k, eps=100, verbose=FALSE)
print(kd.approx$searchTime)

## Takes too long to calculate for this many targets.
## Compare overall accuracy of the exact vs. approximate search
##knn.mode <- function(knn.indx, ref){
##  x <- ref[knn.indx,]
##  as.numeric(names(sort(as.matrix(table(x))[,1],
##                        decreasing=TRUE))[1])
##}

Run the code above in your browser using DataLab