shipunov (version 1.5)

DNN: Distance matrix based kNN classification

Description

DNN uses pre-cooked distance matrix to replace missing values in class labels.

Usage

DNN(dst, cl, k, d, details=FALSE, self=FALSE)
Dnn(trn, tst, classes, FUN=function(.x) dist(.x), ...)

Arguments

dst

Distance matrix (object of class 'dist').

cl

Factor of class labels, should contain NAs to designate testing sub-group.

k

How many neighbors to select, odd numbers preferable. If specified, do not use "d".

d

Distance to consider for neighborhood, in fractions of maximal distance. If specified, do not use "k".

details

If TRUE, function will return voting matrix. Default is FALSE.

self

Allow self-training? Default is FALSE.

trn

Data to train from, classes variable out.

tst

Data with unknown classes.

classes

Classes variable for training data.

FUN

Function to calculate distances, by default, just dist() (i.e., Euclidean distances).

...

Additional arguments from Dnn() to DNN(), note that either 'k' or 'd' must be specified.

Value

Character vector with predicted class labels; or matrix if 'details=TRUE'.

Details

If classic kNN is a lazy classifier, DNN is super-lazy because it does not even calculate the distance matrix itself. Instead, you supply it with distance matrix (object of class 'dist') pre-computed with _any_ possible tool. This lifts many restrictions. For example, arbitrary distance could be used (like Gower distance which allows any type of variable). This is also much faster than typical kNN.

In addition to neighbor-based kNN classification, DNN implements _neighborhood_ classification when all neighbors within selected distance used for voting.

As usual in kNN, ties are broken at random. DNN also controls situations when no neighbors are within the given distance (and returns NA), and also when all neighbors are relevant (also returns NA).

By default, DNN() returns missing part of class labels, completely or partially filled with new (predicted) class labels. If 'cl' has no NAs and self=FALSE (default), DNN() returns it back with warning. It allows for combined and stepwise extensions (see examples). If 'details=TRUE', DNN() will return matrix where each column represents the table used for voting. If self=TRUE, DNN() could be used to calculate the class proximity surrogate.

Dnn() is based on DNN() but has more class::knn()-like interface (see examples).

See Also

class::knn

Examples

Run this code
# NOT RUN {
iris.d <- dist(iris[, -5])

cl1 <- iris$Species
sam <- c(rep(0, 4), 1) > 0
cl1[!sam] <- NA
table(cl1, useNA="ifany")

## based on neighbor number
iris.pred <- DNN(dst=iris.d, cl=cl1, k=5)
Misclass(iris$Species[is.na(cl1)], iris.pred)

## based on neighborhood size
iris.pred <- DNN(dst=iris.d, cl=cl1, d=0.05)
table(iris.pred, useNA="ifany")
Misclass(iris$Species[is.na(cl1)], iris.pred)

## protection against "all points relevant"
DNN(dst=iris.d, cl=cl1, d=1)[1:5]
## and all are ties:
DNN(dst=iris.d, cl=cl1, d=1, details=TRUE)[, 1:5]

## any distance works
iris.d2 <- Gower.dist(iris[, -5])
iris.pred <- DNN(dst=iris.d2, cl=cl1, k=5)
Misclass(iris$Species[is.na(cl1)], iris.pred)

## combined
cl2 <- cl1
iris.pred <- DNN(dst=iris.d, cl=cl2, d=0.05)
cl2[is.na(cl2)] <- iris.pred
table(cl2, useNA="ifany")
iris.pred2 <- DNN(dst=iris.d, cl=cl2, k=5)
cl2[is.na(cl2)] <- iris.pred2
table(cl2, useNA="ifany")
Misclass(iris$Species, cl2)

## self-training and class proximity surrogate
cl3 <- iris$Species
t(DNN(dst=iris.d, cl=cl3, k=5, details=TRUE, self=TRUE))/5

## Dnn() with more class::knn()-like interface
iris.trn <- iris[sam, ]
iris.tst <- iris[!sam, ]
Dnn(iris.trn[, -5], iris.tst[, -5], iris.trn[, 5], k=7)

## stepwise DNN, note the warning when no NAs left
cl4 <- cl1
for (d in (5:14)/100) {
iris.pred <- DNN(dst=iris.d, cl=cl4, d=d)
cl4[is.na(cl4)] <- iris.pred
}
table(cl4, useNA="ifany")
Misclass(iris$Species, cl4)
## rushing to d=14% gives much worse results
iris.pred <- DNN(dst=iris.d, cl=cl1, d=0.14)
table(iris.pred, useNA="ifany")
Misclass(iris$Species[is.na(cl1)], iris.pred)
# }

Run the code above in your browser using DataLab