A standard tool to detect multivariate outliers is the Mahalanobis distance. This approach is very helpful for the interpretation of the plausibility of a measurement given the value of another. In this approach the Mahalanobis distance is used as a univariate measure itself. We apply the same rules for the identification of outliers as in univariate outliers:
the classical approach from Tukey: \(1.5 * IQR\) from the 1st (\(Q_{25}\)) or 3rd (\(Q_{75}\)) quartile.
the \(6* \sigma\) approach, i.e. any measurement of the Mahalanobis distance not in the interval of \(\bar{x} \pm 3*\sigma\) is considered an outlier.
the approach from Hubert for skewed distributions which is embedded in the R package robustbase
a completely heuristic approach named \(\sigma\)-gap.
For further details, please see the vignette for univariate outlier.
acc_multivariate_outlier(
resp_vars,
id_vars = NULL,
label_col,
n_rules = 4,
study_data,
meta_data
)
a list with:
SummaryTable
: data.frame underlying the plot
SummaryPlot
: ggplot2 outlier plot
FlaggedStudyData
data.frame contains the original data frame with
the additional columns tukey
,
sixsigma
,
hubert
, and sigmagap
. Every
observation
is coded 0 if no outlier was detected in
the respective column and 1 if an
outlier was detected. This can be used
to exclude observations with outliers.
variable list the name of the continuous measurement variables
variable optional, an ID variable of the study data. If not specified row numbers are used.
variable attribute the name of the column in the metadata with labels of variables
numeric from=1 to=4. the no. of rules that must be violated to classify as outlier
data.frame the data frame that contains the measurements
data.frame the data frame that contains metadata attributes of study data
Implementation is restricted to variables of type float
Remove missing codes from the study data (if defined in the metadata)
The covariance matrix is estimated for all resp_vars
The Mahalanobis distance of each observation is calculated \(MD^2_i = (x_i - \mu)^T \Sigma^{-1} (x_i - \mu)\)
The four rules mentioned above are applied on this distance for each observation in the study data
An output data frame is generated that flags each outlier
A parallel coordinate plot indicates respective outliers
List function.