knncatimputeLarge: Missing Value Imputation with kNN for High-Dimensional Data

Description

Imputes missing values in a high-dimensional matrix composed of categorical variables using \(k\) Nearest Neighbors.

Usage

knncatimputeLarge(data, mat.na = NULL, fac = NULL, fac.na = NULL,
   nn = 3, distance = c("smc", "cohen", "snp1norm", "pcc"), 
   n.num = 100, use.weights = TRUE, verbose = FALSE)

Value

If mat.na = NULL, then a matrix of the same size as data in which the missing values have been replaced. If mat.na has been specified, then a matrix of the same size as

mat.na in which the missing values have been replaced.

Arguments

data

a numeric matrix consisting of integers between 1 and \(n_{cat}\), where \(n_{cat}\) is maximum number of levels the categorical variables can take. If mat.na is specified, data is assumed to contain only non-missing data, and the rows of data are used to impute the missing values in mat.na. Otherwise, data is also allowed to contain missing values, and the missing values in the rows of data are imputed by employing the rows of data showing no missing values.

Each row of data represents one of the objects that should be used to identify the \(k\) nearest neighbors, i.e.\ if the \(k\) nearest variables should be used to replace the missing values, then each row must represent one of the variables. If the \(k\) nearest observations should be used to impute the missing values, then each row must correspond to one of the observations.

mat.na

a numeric matrix containing missing values. Must have the same number of columns as data. All non-missing values must be integers between 1 and \(n_{cat}\). If NULL, data is assumed to also contain the rows with missing values.

fac

a numeric or character vector of length nrow{data} specifying the values of a factor used to split data into subsets. If, e.g., the values of fac are given by the chromosomes to which the SNPs represented by the rows of data belong, then \(k\) nearest neighbors is applied chromosomewise to the missing values in mat.na (or data). If NULL, no such splitting is done. Must be specified, if fac.na is specified.

fac.na

a numeric or character vector of length nrow{mat.na} specifying the values of a factor by which mat.na is split into subsets. Each possible value of fac.na must be at least nn times in fac. Must be specified, if fac and mat.na is specified. If both fac and fac.na are NULL, then no splitting is done.

nn

an integer specifying \(k\), i.e.\ the number of nearest neighbors, used to impute the missing values.

distance

character string naming the distance measure used in \(k\) Nearest Neighbors. Must be either "smc" (default), "cohen", "snp1norm" (which denotes the Manhattan distance for SNPs), or "pcc".

n.num

an integer giving the number of rows of mat.na considered simultaneously when replacing the missing values in mat.na.

use.weights

should weighted \(k\) nearest neighbors be used to impute the missing values? If TRUE, the votes of the nearest neighbors are weighted by the reciprocal of their distances to the variable (or observation) whose missing values are imputed.

verbose

should more information about the progress of the imputation be printed?

Author

Holger Schwender, holger.schwender@udo.edu

References

Schwender, H. and Ickstadt, K.\ (2008). Imputing Missing Genotypes with \(k\) Nearest Neighbors. Technical Report, SFB 475, Department of Statistics, University of Dortmund. Appears soon.

Examples

Run this code

if (FALSE) {
# Generate a data set consisting of 100 columns and 2000 rows (actually,
# knncatimputeLarge is made for much larger data sets), where the values
# are randomly drawn from the integers 1, 2, and 3.
# Afterwards, remove 200 of the observations randomly.

mat <- matrix(sample(3, 200000, TRUE), 2000)
mat[sample(200000, 20)] <- NA

# Apply knncatimputeLarge to mat to remove the missing values.

mat2 <- knncatimputeLarge(mat)
sum(is.na(mat))
sum(is.na(mat2))

# Now assume that the first 100 rows belong to SNPs from chromosome 1,
# the second 100 rows to SNPs from chromosome 2, and so on.

chromosome <- rep(1:20, e = 100)

# Apply knncatimputeLarge to mat chromosomewise, i.e. only consider
# the SNPs that belong to the same chromosome when replacing missing
# genotypes.

mat4 <- knncatimputeLarge(mat, fac = chromosome)

}

Run the code above in your browser using DataLab