kNN: k-Nearest Neighbour Imputation

Description

k-Nearest Neighbour Imputation based on a variation of the Gower Distance for numerical, categorical, ordered and semi-continous variables.

Usage

kNN(
  data,
  variable = colnames(data),
  metric = NULL,
  k = 5,
  dist_var = colnames(data),
  weights = NULL,
  numFun = median,
  catFun = maxCat,
  makeNA = NULL,
  NAcond = NULL,
  impNA = TRUE,
  donorcond = NULL,
  mixed = vector(),
  mixed.constant = NULL,
  trace = FALSE,
  imp_var = TRUE,
  imp_suffix = "imp",
  addRF = FALSE,
  onlyRF = FALSE,
  addRandom = FALSE,
  useImputedDist = TRUE,
  weightDist = FALSE
)
# S3 method for data.table
kNN(
  data,
  variable = colnames(data),
  metric = NULL,
  k = 5,
  dist_var = colnames(data),
  weights = NULL,
  numFun = median,
  catFun = maxCat,
  makeNA = NULL,
  NAcond = NULL,
  impNA = TRUE,
  donorcond = NULL,
  mixed = vector(),
  mixed.constant = NULL,
  trace = FALSE,
  imp_var = TRUE,
  imp_suffix = "imp",
  addRF = FALSE,
  onlyRF = FALSE,
  addRandom = FALSE,
  useImputedDist = TRUE,
  weightDist = FALSE
)
# S3 method for data.frame
kNN(
  data,
  variable = colnames(data),
  metric = NULL,
  k = 5,
  dist_var = colnames(data),
  weights = NULL,
  numFun = median,
  catFun = maxCat,
  makeNA = NULL,
  NAcond = NULL,
  impNA = TRUE,
  donorcond = NULL,
  mixed = vector(),
  mixed.constant = NULL,
  trace = FALSE,
  imp_var = TRUE,
  imp_suffix = "imp",
  addRF = FALSE,
  onlyRF = FALSE,
  addRandom = FALSE,
  useImputedDist = TRUE,
  weightDist = FALSE
)
# S3 method for survey.design
kNN(
  data,
  variable = colnames(data),
  metric = NULL,
  k = 5,
  dist_var = colnames(data),
  weights = NULL,
  numFun = median,
  catFun = maxCat,
  makeNA = NULL,
  NAcond = NULL,
  impNA = TRUE,
  donorcond = NULL,
  mixed = vector(),
  mixed.constant = NULL,
  trace = FALSE,
  imp_var = TRUE,
  imp_suffix = "imp",
  addRF = FALSE,
  onlyRF = FALSE,
  addRandom = FALSE,
  useImputedDist = TRUE,
  weightDist = FALSE
)
# S3 method for default
kNN(
  data,
  variable = colnames(data),
  metric = NULL,
  k = 5,
  dist_var = colnames(data),
  weights = NULL,
  numFun = median,
  catFun = maxCat,
  makeNA = NULL,
  NAcond = NULL,
  impNA = TRUE,
  donorcond = NULL,
  mixed = vector(),
  mixed.constant = NULL,
  trace = FALSE,
  imp_var = TRUE,
  imp_suffix = "imp",
  addRF = FALSE,
  onlyRF = FALSE,
  addRandom = FALSE,
  useImputedDist = TRUE,
  weightDist = FALSE
)

Arguments

data

data.frame or matrix

variable

variables where missing values should be imputed

metric

metric to be used for calculating the distances between

number of Nearest Neighbours used

dist_var

names or variables to be used for distance calculation

weights

weights for the variables for distance calculation. If weights = "auto" weights will be selected based on variable importance from random forest regression, using function ranger. Weights are calculated for each variable seperately.

numFun

function for aggregating the k Nearest Neighbours in the case of a numerical variable

catFun

function for aggregating the k Nearest Neighbours in the case of a categorical variable

makeNA

list of length equal to the number of variables, with values, that should be converted to NA for each variable

NAcond

list of length equal to the number of variables, with a condition for imputing a NA

impNA

TRUE/FALSE whether NA should be imputed

donorcond

condition for the donors e.g. list(">5"), must be NULL or a list of same length as variable

mixed

names of mixed variables

mixed.constant

vector with length equal to the number of semi-continuous variables specifying the point of the semi-continuous distribution with non-zero probability

trace

TRUE/FALSE if additional information about the imputation process should be printed

imp_var

TRUE/FALSE if a TRUE/FALSE variables for each imputed variable should be created show the imputation status

imp_suffix

suffix for the TRUE/FALSE variables showing the imputation status

addRF

TRUE/FALSE each variable will be modelled using random forest regression (ranger) and used as additional distance variable.

onlyRF

TRUE/FALSE if TRUE only additional distance variables created from random forest regression will be used as distance variables.

addRandom

TRUE/FALSE if an additional random variable should be added for distance calculation

useImputedDist

TRUE/FALSE if an imputed value should be used for distance calculation for imputing another variable. Be aware that this results in a dependency on the ordering of the variables.

weightDist

TRUE/FALSE if the distances of the k nearest neighbours should be used as weights in the aggregation step

Value

the imputed data set.

References

A. Kowarik, M. Templ (2016) Imputation with R package VIM. Journal of Statistical Software, 74(7), 1-16.

Examples

Run this code

# NOT RUN {
data(sleep)
kNN(sleep)
library(laeken)
kNN(sleep, numFun = weightedMean, weightDist=TRUE)

# }

Run the code above in your browser using DataLab