This function performs the MacroPCA algorithm, which can deal with Missing values and Cellwise
and Rowwise Outliers. Note that this function first calls  checkDataSet and analyzes the remaining cleaned data.
MacroPCA(X, k = 0, MacroPCApars = NULL)X is the input data, and must be an \(n\) by \(d\) matrix or a data frame.
k is the desired number of principal components.
  If k = 0 or k = NULL, the algorithm will compute the percentage
  of explained variability for k upto kmax and show a scree plot,
  and suggest to choose a value of k such that the cumulative percentage of
  explained variability is at least 80 %.
A list of available options detailed below. If MacroPCApars = NULL the defaults below are used.
DDCpars  A list with parameters for the first step of the MacroPCA
      algorithm (for the complete list see the function
      DDC). Default is NULL.
kmax  The maximal number of principal components to compute. Default
       is kmax = 10. If k is provided kmax does not need to be specified,
       unless k is larger than 10 in which case you need to set kmax
       high enough.
alpha  This is the coverage, i.e. the fraction of rows the algorithm
      should give full weight. Alpha should be between 0.50 and 1, the default is
      0.50.
scale  A value indicating whether and how the original variables should
      be scaled. If scale = FALSE or scale = NULL no scaling is
      performed (and a vector of 1s is returned in the $scaleX slot).
      If scale = TRUE (default) the data are scaled by a 1-step M-estimator of scale with the Tukey biweight weight function to have a robust scale of 1.
      Alternatively scale can be a vector of length
      equal to the number of columns of x. The resulting scale estimates are
      returned in the $scaleX slot of the MacroPCA output.
maxdir  The maximal number of random directions to use for computing the
      outlyingness of the data points. Default is maxdir = 250. If the number
      \(n\) of observations is small all \(n * (n - 1) / 2\) pairs of
      observations are used.
distprob  The quantile determining the cutoff values
      for orthogonal and score distances. Default is 0.99.
silent 
      If TRUE, statements tracking the algorithm's progress will not be printed. Defaults to FALSE.
maxiter  Maximum number of iterations. Default is 20.
tol  Tolerance for iterations. Default is 0.005.
bigOutput  whether to compute and return NAimp, Cellimp and Fullimp. Defaults to TRUE.
A list with components:
the options used in the call.
Cleaned data after checkDataSet.
results of the first step of MacroPCA. These are needed to run MacroPCApredict on new data.
the scales of the columns of X.
the number of principal components.
the columns are the k loading vectors.
the k eigenvalues.
vector with the fitted center.
alpha from the input.
h (computed from alpha).
number of iteration steps.
convergence criterion.
data with all NA's imputed by MacroPCA.
scores of X.NAimp.
orthogonal distances of the rows of X.NAimp.
cutoff value for the OD.
score distances of the rows of X.NAimp.
cutoff value for the SD.
row numbers of rowwise outliers.
scale of the residuals.
standardized residuals. Note that these are NA
  for all missing values of X.
indices of cellwise outliers.
various results for the NA-imputed data.
various results for the cell-imputed data.
various result for the fully imputed data.
Hubert, M., Rousseeuw, P.J., Van den Bossche W. (2019). MacroPCA: An all-in-one PCA method allowing for missing values as well as cellwise and rowwise outliers. Technometrics, 61(4), 459-473. (link to open access pdf)
# NOT RUN {
library(MASS) 
set.seed(12345) 
n <- 50; d <- 10
A <- matrix(0.9, d, d); diag(A) = 1
x <- mvrnorm(n, rep(0,d), A)
x[sample(1:(n * d), 50, FALSE)] <- NA
x[sample(1:(n * d), 50, FALSE)] <- 10
x <- cbind(1:n, x)
MacroPCA.out <- MacroPCA(x, 2)
cellMap(MacroPCA.out$remX, MacroPCA.out$stdResid,
columnlabels = 1:d, rowlabels = 1:n)
# For more examples, we refer to the vignette:
vignette("MacroPCA_examples")
# }
Run the code above in your browser using DataLab