Learn R Programming

NADIA (version 0.4.2)

missMDA_FMAD_MCA_PCA: Perform imputation using MCA, PCA, or FMAD algorithm.

Description

Function use missMDA package to perform data imputation. Function can found the best number of dimensions for this imputation. User can choose whether to return one imputed dataset or list or imputed datasets form Multiple Imputation.

Usage

missMDA_FMAD_MCA_PCA(
  df,
  col_type = NULL,
  percent_of_missing = NULL,
  optimize_ncp = TRUE,
  set_ncp = 2,
  col_0_1 = FALSE,
  ncp.max = 5,
  return_one = TRUE,
  random.seed = 123,
  maxiter = 998,
  coeff.ridge = 1,
  threshold = 1e-06,
  method = "Regularized",
  out_file = NULL,
  return_ncp = FALSE
)

Value

Retrun one imputed data.frame if retrun_one=True or list of imputed data.frames if retrun_one=False.

Arguments

df

data.frame. Df to impute with column names and without target column.

col_type

character vector. Vector containing column type names.

percent_of_missing

numeric vector. Vector contatining percent of missing data in columns for example c(0,1,0,0,11.3,..)

optimize_ncp

logical. If true number of dimensions used to predict the missing entries will be optimized. If False by default ncp = 2 it's used.

set_ncp

intiger >0. Number of dimensions used by algortims. Used only if optimize_ncp = Flase.

col_0_1

Decaid if add bonus column informing where imputation been done. 0 - value was in dataset, 1 - value was imputed. Default False. (Works only for returning one dataset).

ncp.max

integer corresponding to the maximum number of components to test. Default 5.

return_one

One or many imputed sets will be returned. Default True.

random.seed

integer, by default random.seed = NULL implies that missing values are initially imputed by the mean of each variable. Other values leads to a random initialization

maxiter

maximal number of iteration in algortihm.

coeff.ridge

Value use in Regularized method.

threshold

threshold for convergence.

method

method used in imputation algoritm.

out_file

Output log file location if file already exists log message will be added. If NULL no log will be produced.

return_ncp

Function should return used ncp value

Author

Julie Josse, Francois Husson (2016) tools:::Rd_expr_doi("10.18637/jss.v070.i01")

Details

Function use different algorithm to adjust for variable types in df. For only numeric data PCA will be used. MCA for only categorical and FMAD for mixed. If optimize==TRUE function will try to find optimal ncp if its not possible default ncp=2 will be used. In some cases ncp=1 will be used if ncp=2 don't work. For multiple imputations, if set ncp don't work error will be return.

References

Julie Josse, Francois Husson (2016). missMDA: A Package for Handling Missing Values in Multivariate Data Analysis. Journal of Statistical Software, 70(1), 1-31. doi:10.18637/jss.v070.i01

Examples

Run this code
{
  raw_data <- data.frame(
    a = as.factor(sample(c("red", "yellow", "blue", NA), 1000, replace = TRUE)),
    b = as.integer(1:1000),
    c = as.factor(sample(c("YES", "NO", NA), 1000, replace = TRUE)),
    d = runif(1000, 1, 10),
    e = as.factor(sample(c("YES", "NO"), 1000, replace = TRUE)),
    f = as.factor(sample(c("male", "female", "trans", "other", NA), 1000, replace = TRUE)))

  # Prepering col_type
  col_type <- c("factor", "integer", "factor", "numeric", "factor", "factor")

  percent_of_missing <- 1:6
  for (i in percent_of_missing) {
    percent_of_missing[i] <- 100 * (sum(is.na(raw_data[, i])) / nrow(raw_data))
  }


  imp_data <- missMDA_FMAD_MCA_PCA(raw_data, col_type, percent_of_missing, optimize_ncp = FALSE)
  # Check if all missing value was imputed
  sum(is.na(imp_data)) == 0
  # TRUE
}

Run the code above in your browser using DataLab