This function aims to detect cellwise outliers in the data. These are entries in the data matrix which are substantially higher or lower than what could be expected based on the other cells in its column as well as the other cells in its row, taking the relations between the columns into account. Note that this function first calls checkDataSet and analyzes the remaining cleaned data.
DDC(X, DDCpars = list())X is the input data, and must be an \(n\) by \(d\) matrix or a data frame.
A list of available options:
fracNA 
      Only consider columns and rows with fewer NAs (missing
      values) than this fraction (percentage). Defaults to \(0.5\).
numDiscrete 
      A column that takes on numDiscrete or fewer values will
      be considered discrete and not used in the analysis. Defaults to \(3\).
precScale 
      Only consider columns whose scale is larger than precScale.
      Here scale is measured by the median absolute deviation. Defaults to \(1e-12\).
cleanNAfirst 
      If "columns", first columns then rows are checked for NAs.
      If "rows", first rows then columns are checked for NAs.
      "automatic" checks columns first if \(d \geq 5n\) and rows first otherwise.
      Defaults to "automatic".
tolProb 
      Tolerance probability, with default \(0.99\), which
      determines the cutoff values for flagging outliers in
      several steps of the algorithm.
corrlim 
      When trying to estimate \(z_{ij}\) from other variables \(h\), we 
      will only use variables \(h\) with \(|\rho_{j,h}| \ge corrlim\).
      Variables \(j\) without any correlated variables \(h\) satisfying 
      this are considered standalone, and treated on their own. Defaults to \(0.5\).
combinRule 
      The operation to combine estimates of \(z_{ij}\) coming from
      other variables \(h\): can be "mean", "median",
      "wmean" (weighted mean) or "wmedian" (weighted median).
      Defaults to wmean.
returnBigXimp 
      If TRUE, the imputed data matrix Ximp in the output
      will include the rows and columns that were not
      part of the analysis (and can still contain NAs). Defaults to FALSE.
silent 
      If TRUE, statements tracking the algorithm's progress will not be printed. Defaults to FALSE.
nLocScale 
    When estimating location or scale from more than nLocScale data values, the computation is based on a random sample of size nLocScale to save time. When  nLocScale = 0 all values are used. Defaults to 25000.
fastDDC 
      Whether to use the fastDDC option or not. The fastDDC algorithm uses approximations
to allow to deal with high dimensions. Defaults to TRUE for \(d > 750\) and FALSE otherwise.
standType 
      The location and scale estimators used for robust standardization. Should be one of "1stepM", "mcd" or "wrap". See estLocScale for more info. Only used when fastDDC = FALSE. Defaults to "1stepM".
corrType 
      The correlation estimator used to find the neighboring variables. Must be one of "wrap" (wrapping correlation), "rank" (Spearman correlation) or "gkwls" (Gnanadesikan-Kettenring correlation followed by weighting). Only used when fastDDC  = FALSE. Defaults to "gkwls".
transFun 
      The transformation function used to compute the robust correlations when fastDDC = TRUE. Can be "wrap" or "rank". Defaults to "wrap".
nbngbrs 
     When fastDDC = TRUE, each column is predicted from at most nbngbrs columns correlated to it.
     Defaults to 100.
A list with components:
DDCpars 
    The list of options used.
colInAnalysis 
    The column indices of the columns used in the analysis.
rowInAnalysis 
    The row indices of the rows used in the analysis.
namesNotNumeric 
    The names of the variables which are not numeric.
namesCaseNumber 
    The name of the variable(s) which contained the case numbers and was therefore removed.
namesNAcol 
    Names of the columns left out due to too many NA's.
namesNArow 
    Names of the rows left out due to too many NA's.
namesDiscrete 
    Names of the discrete variables.
namesZeroScale 
    Names of the variables with zero scale.
remX 
    Cleaned data after checkDataSet.
locX 
    Estimated location of X.
scaleX 
    Estimated scales of X.
Z 
    Standardized remX.
nbngbrs 
    Number of neighbors used in estimation.
ngbrs 
    Indicates neighbors of each column, i.e. the columns most correlated with it.
robcors 
    Robust correlations.
robslopes 
    Robust slopes.
deshrinkage 
    The deshrinkage factor used for every connected (i.e. non-standalone) column of X.
Xest 
    Predicted X.
scalestres 
    Scale estimate of the residuals X - Xest.
stdResid 
    Residuals of orginal X minus the estimated Xest, standardized by column.
indcells 
    Indices of the cells which were flagged in the analysis.
Ti 
    Outlyingness (test) value of each row.
medTi 
    Median of the Ti values.
madTi 
    Mad of the Ti values.
indrows 
    Indices of the rows which were flagged in the analysis.
indNAs 
    Indices of all NA cells.
indall 
    Indices of all cells which were flagged in the analysis plus all cells in flagged rows plus the indices of the NA cells.
Ximp 
    Imputed X.
Rousseeuw, P.J., Van den Bossche W. (2018). Detecting Deviating Data Cells. Technometrics, 60(2), 135-145. (link to open access pdf)
Raymaekers, J., Rousseeuw P.J. (2019). Fast robust correlation for high dimensional data. Technometrics, published online. (link to open access pdf)
# NOT RUN {
library(MASS); set.seed(12345)
n <- 50; d <- 20
A <- matrix(0.9, d, d); diag(A) = 1
x <- mvrnorm(n, rep(0,d), A)
x[sample(1:(n * d), 50, FALSE)] <- NA
x[sample(1:(n * d), 50, FALSE)] <- 10
x[sample(1:(n * d), 50, FALSE)] <- -10
x <- cbind(1:n, x)
DDCx <- DDC(x)
cellMap(DDCx$remX, DDCx$stdResid,
columnlabels = 1:d, rowlabels = 1:n)
# For more examples, we refer to the vignette:
vignette("DDC_examples")
# }
Run the code above in your browser using DataLab