Last chance! 50% off unlimited learning
Sale ends in
impute.knn(data ,k = 10, rowmax = 0.5, colmax = 0.8, maxp = 1500, rng.seed=362436069)
rowmax
% missing
are imputed using the overall mean per sample.colmax
% missing data,
the program halts and reports an error.impute.knn
(default
1500); larger blocks are divided by two-means clustering
(recursively) prior to imputation. If maxp=p
, only knn
imputation is done.set.seed
. Otherwise, it is
NULL
. If necessary, this can be used in the calling code to
undo the side-effect of changing the random number generator
sequence.impute.knn
uses $k$-nearest neighbors in the space of genes to impute missing
expression values.
For each gene with missing values, we find the $k$ nearest neighbors using
a Euclidean metric, confined to the columns for which that gene is NOT
missing. Each candidate neighbor might be missing some of the
coordinates used to calculate the distance. In this case we average the
distance from the non-missing coordinates. Having found the k nearest
neighbors for a gene, we impute the missing elements by averaging those
(non-missing) elements of its neighbors. This can fail if ALL the
neighbors are missing in a particular element. In this case we use the
overall column mean for that block of genes.
Since nearest neighbor imputation costs
$O(p*log(p))$ operations per gene, where $p$ is the
number of rows, the computational time can be excessive for large p and
a large number of missing rows. Our strategy is to break blocks with
more than maxp
genes into two smaller blocks using two-mean
clustering. This is done recursively till all blocks have less than
maxp
genes. For each block, $k$-nearest neighbor
imputation is done separately.
We have set the default value of maxp
to 1500. Depending on the
speed of the machine, and number of samples, this number might be
increased. Making it too small is counter-productive, because the
number of two-mean clustering algorithms will increase.For reproducibility, this function reseeds the random number generator using the seed provided or the default seed (362436069).
data(khanmiss)
khan.expr <- khanmiss[-1, -(1:2)]
##
## First example
##
if(exists(".Random.seed")) rm(.Random.seed)
khan.imputed <- impute.knn(as.matrix(khan.expr))
##
## khan.imputed$data should now contain the imputed data matrix
## khan.imputed$rng.seed should contain the random number seed used
## in imputation. In the above invocation, it is the default seed.
##
khan.imputed$rng.seed # should be 362436069
khan.imputed$rng.state # should be NULL
##
## Second example
##
set.seed(12345)
saved.state <- .Random.seed
khan.imputed <- impute.knn(as.matrix(khan.expr))
# Assuming all goes well with no guarantees in case of error...
.Random.seed <- khan.imputed$rng.state
sum(saved.state - khan.imputed$rng.state) # should be zero!
save(khan.imputed, file="khanimputation.Rda")
Run the code above in your browser using DataLab