CVknn: Cross-validation for K nearest-neighbor regression

Description

This function calculates the estimated cross-validation prediction error for K nearest-neighbor regression and returns a suitable choice for K.

Usage

CVknn(X, Dvec, V, K.list = NULL, type = "eucli", plot = FALSE)

Arguments

a numeric design matrix, which used in rhoKNN to estimate probabilities of the disease status.

Dvec

a n * 3 binary matrix with three columns, corresponding to the three classes of the disease status. In row i, 1 in column j indicates that the i-th subject belongs to class j, with j = 1, 2, 3. A row of NA values indicates a non-verified subject.

a binary vector containing the verification status (1 verified, 0 not verified).

K.list

a list of candidate values for K. If NULL(the default), the set \(\{1, 2, ..., n.ver\}\) is employed, where, \(n.ver\) is the number of verified subjects.

type

a type of distance, see rhoKNN for more details. Default "eucli".

plot

if TRUE, a plot of cross-validation prediction error is produced.

Value

A suitable choice for K is returned.

Details

Data are divided into two groups, the first contains the data corresponding to V = 1, whereas the second contains the data corresponding to V = 0. In the first group, the discrepancy between the true disease status and the KNN estimates of the probabilities of the disease status is computed by varying K from 1 to the number of verification subjects, see To Duc et al. (2016). The optimal value of K is the value that corresponds to the smallest value of the discrepancy.

References

To Duc, K., Chiogna, M., Adimari, G. (2016): Nonparametric Estimation of ROC Surfaces Under Verification Bias. https://arxiv.org/abs/1604.04656v1. Submitted.

Examples

Run this code

# NOT RUN {
data(EOC)
XX <- cbind(EOC$CA125, EOC$CA153, EOC$Age)
Dna <- preDATA(EOC$D, EOC$CA125)
Dvec.na <- Dna$Dvec
CVknn(XX, Dvec.na, EOC$V, type = "mahala", plot = TRUE)

# }

Run the code above in your browser using DataLab