estim_ncpMCA: Estimate the number of dimensions for the Multiple Correspondence Analysis by cross-validation

Description

Estimate the number of dimensions for the Multiple Correspondence Analysis by cross-validation

Usage

estim_ncpMCA(don, ncp.min=0, ncp.max=5,  method = c("Regularized","EM"), 
     method.cv = c("gcv","loo","Kfold"), nbsim=100, pNA=0.05, threshold=1e-4)

Arguments

don

a data.frame with categorical variables; with missing entries or not

ncp.min

integer corresponding to the minimum number of components to test

ncp.max

integer corresponding to the maximum number of components to test

method

"Regularized" by default or "EM"

method.cv

string with the values "gcv" for generalised cross-validation, "loo" for leave-one-out or "Kfold" for cross-validation

nbsim

number of simulations, useful only if method.cv="Kfold"

pNA

percentage of missing values added in the data set, useful only if method.cv="Kfold"

threshold

the threshold for assessing convergence

Value

ncpthe number of components retained for the MCA
criterionthe criterion (the MSEP) calculated for each number of components

Details

For leave-one-out cross-validation (method.cv="loo"), each cell of the data matrix is alternatively removed and predicted with a MCA model using ncp.min to ncp.max dimensions. The number of components which leads to the smallest mean square error of prediction (MSEP) is retained. For the Kfold cross-validation (method.cv="Kfold"), pNA percentage of missing values is inserted at random in the data matrix and predicted with a MCA model using ncp.min to ncp.max dimensions. This process is repeated nbsim times. The number of components which leads to the smallest MSEP is retained. More precisely, for both cross-validation methods, the missing entries are predicted using the imputeMCA function, it means using it means using the regularized iterative MCA algorithm (method="Regularized") or the iterative MCA algorithm (method="EM"). The regularized version is more appropriate to avoid overfitting issues. The cross-validation strategy is time-consuming. A less computationally greedy method consists in using the generalised cross-validation criterion (method.cv="gcv").

References

Josse, J., Chavent, M., Liquet, B. and Husson, F. (2010). Handling missing values with Regularized Iterative Multiple Correspondence Analysis, Journal of Clcassification, 29 (1), pp. 91-116.

Examples

Run this code

data(vnf)
result <- estim_ncpMCA(vnf,ncp.min=0, ncp.max=5)

Run the code above in your browser using DataLab