knncatimputeLarge: Missing Value Imputation with kNN for High-Dimensional Data

Description

Imputes missing values in a high-dimensional matrix composed of categorical variables using $k$ Nearest Neighbors.

Usage

knncatimputeLarge(data, mat.na = NULL, fac = NULL, fac.na = NULL,
   nn = 3, distance = c("smc", "cohen", "snp1norm", "pcc"), 
   n.num = 100, use.weights = TRUE, verbose = FALSE)

Arguments

data

a numeric matrix consisting of integers between 1 and $n_{cat}$, where $n_{cat}$ is maximum number of levels the categorical variables can take. If mat.na is specified, data is assumed to contain only non-missing d

mat.na

a numeric matrix containing missing values. Must have the same number of columns as data. All non-missing values must be integers between 1 and $n_{cat}$. If NULL, data is assumed to also contain the r

fac

a numeric or character vector of length nrow{data} specifying the values of a factor used to split data into subsets. If, e.g., the values of fac are given by the chromosomes to which the SNPs represented b

fac.na

a numeric or character vector of length nrow{mat.na} specifying the values of a factor by which mat.na is split into subsets. Each possible value of fac.na must be at least nn times in fa

an integer specifying $k$, i.e. the number of nearest neighbors, used to impute the missing values.

distance

character string naming the distance measure used in $k$ Nearest Neighbors. Must be either "smc" (default), "cohen", "snp1norm" (which denotes the Manhattan distance for SNPs), or "pcc".

n.num

an integer giving the number of rows of mat.na considered simultaneously when replacing the missing values in mat.na.

use.weights

should weighted $k$ nearest neighbors be used to impute the missing values? If TRUE, the votes of the nearest neighbors are weighted by the reciprocal of their distances to the variable (or observation) whose missing values are impu

verbose

should more information about the progress of the imputation be printed?

Value

If mat.na = NULL, then a matrix of the same size as data in which the missing values have been replaced. If mat.na has been specified, then a matrix of the same size as mat.na in which the missing values have been replaced.

References

Schwender, H. and Ickstadt, K. (2008). Imputing Missing Genotypes with $k$ Nearest Neighbors. Technical Report, SFB 475, Department of Statistics, University of Dortmund. Appears soon.

Examples

Run this code

# Generate a data set consisting of 100 columns and 2000 rows (actually,
# knncatimputeLarge is made for much larger data sets), where the values
# are randomly drawn from the integers 1, 2, and 3.
# Afterwards, remove 200 of the observations randomly.

mat <- matrix(sample(3, 200000, TRUE), 2000)
mat[sample(200000, 20)] <- NA

# Apply knncatimputeLarge to mat to remove the missing values.

mat2 <- knncatimputeLarge(mat)
sum(is.na(mat))
sum(is.na(mat2))

# Now assume that the first 100 rows belong to SNPs from chromosome 1,
# the second 100 rows to SNPs from chromosome 2, and so on.

chromosome <- rep(1:20, e = 100)

# Apply knncatimputeLarge to mat chromosomewise, i.e. only consider
# the SNPs that belong to the same chromosome when replacing missing
# genotypes.

mat4 <- knncatimputeLarge(mat, fac = chromosome)

Run the code above in your browser using DataLab