knncatimpute: Missing Value Imputation with kNN

Description

Imputes missing values in a matrix composed of categorical variables using \(k\) Nearest Neighbors.

Usage

knncatimpute(x, dist = NULL, nn = 3, weights = TRUE)

Value

A matrix of the same size as x in which all the missing values have been imputed.

Arguments

x: a numeric matrix containing missing values. All non-missing values must be integers between 1 and \(n_{cat}\), where \(n_{cat}\) is the maximum number of levels the categorical variables in x can take. If the \(k\) nearest observations should be used to replace the missing values of an observation, then each row must represent one of the observations and each column one of the variables. If the \(k\) nearest variables should be used to impute the missing values of a variable, then each row must correspond to a variable and each column to an observation.
dist: either a character string naming the distance measure or a distance matrix. If the former, dist must be either "smc", "cohen", or "pcc". If the latter, dist must be a symmetric matrix having the same number of rows as x. In this case, both the upper and the lower triangle of dist must contain the distances, and the row and column names of dist must be equal to the row names of x. If NULL, dist = "smc" is used.
nn: an integer specifying \(k\), i.e.\ the number of nearest neighbors, used in the imputation of the missing values.
weights: should weighted \(k\)NN be used to impute the missing values? If TRUE, the vote of each nearest neighbor is weighted by the reciprocal of its distance to the observation or variable when the missing values of this observation or variable, respectively, are replaced.

Author

Holger Schwender, holger.schwender@udo.edu

References

Schwender, H.\ (2007). Statistical Analysis of Genotype and Gene Expression Data. Dissertation, Department of Statistics, University of Dortmund.

Examples

Run this code

if (FALSE) {
# Generate a data set consisting of 200 rows and 50 columns
# in which the values are integers between 1 and 3.
# Afterwards, remove 20 of the values randomly.

mat <- matrix(sample(3, 10000, TRUE), 200)
mat[sample(10000, 20)] <- NA

# Replace the missing values.

mat2 <- knncatimpute(mat)

# Replace the missing values using the 5 nearest neighbors
# and Cohen's Kappa.

mat3 <- knncatimpute(mat, nn = 5, dist = "cohen")

}

Run the code above in your browser using DataLab