fastmcd(x, cor = FALSE, print.it = TRUE, quan = floor((n + p + 1)/2), ntrial = 500)
cor = TRUE
then the estimated correlation matrix will be returned as well.print.it = TRUE
information about the method will be printed.quan
is floor((n+p+1)/2)
, where n
is the number of observations and p
is the number of varimcd
" with components:floor((n+p+1)/2)
, where n
is the number of observations and p
the number of variables.cor = TRUE
.print.it = TRUE
a message is printed.(n-quan)/n
, which is about 50%
for the default quan. That is, the estimate cannot be made arbitrarily bad without changing about half of the data. A covariance matrix is considered to be arbitrarily bad if some eigenvalue goes to infinity or to zero (singular matrix). This is analogous to a univariate scale estimate, which breaks down if the estimate is going either to infinity or to zero.Rousseeuw, P. J. (1984). Least median of squares regression. Journal of the American Statistical Association, 79, 871-881.
Rousseeuw, P. J. (1985). Multivariate estimation with high breakdown point. In Mathematical Statistics and Applications. W. Grossmann, G. Pflug, I. Vincze and W. Wertz, eds. Reidel: Dordrecht, 283-297.
Rousseeuw, P. J. and Leroy, A. M. (1987). Robust Regression and Outlier Detection. Wiley-Interscience, New York. [Chapter 7]
Rousseeuw, P. J. and van Zomeren, B. C. (1990). Unmasking multivariate outliers and leverage points. Journal of the American Statistical Association, 85, 633-639.
n
be the number of observations and p
be the number of variables. The minimum covariance determinant estimate is given by the subset of quan
observations of which the determinant of their covariance matrix is minimal. The MCD location estimate is then the mean of those quan
points, and the MCD scatter estimate is their covariance matrix. The default value of quan
is floor((n+p+1)/2)
, but the user may choose a larger number. For multivariate data sets, it takes too much time to find the exact estimate, so an approximation is computed. A full description of the present algorithm can be found in Rousseeuw and Van Driessen (1997). Major advantages of this algorithm are its precision and the fact that it can deal with very large n
.Although the raw minimum covariance determinant estimate has a high breakdown value, its statistical efficiency is low. A better finite-sample efficiency can be attained while retaining the high breakdown value by computing a weighted mean and covariance estimate, with weights based on the MCD estimate. By default, fastmcd returns both the raw MCD estimate and the weighted estimate.
Multivariate outliers can be found by means of the robust distances, as described in Rousseeuw and Leroy (1987) and in Rousseeuw and Van Zomeren (1990). These distances can be calculated by the function mahalanobis
, and plotted by applying plot.mcd on a "mcd" object. It is suggested that the number of observations be at least five times the number of variables. When there are fewer observations than this, there is not enough information to determine whether outliers exist.
An important advantage of the present algorithm is that it allows for exact fit situations, where more than quan observations lie on a hyperplane. Then the program still yields the MCD location and scatter matrix, the latter being singular (as it should be), as well as the equation of the hyperplane.
If the classical covariance matrix of the data is already singular, all observations lie on a hyperplane. Then fastmcd will give a message and the equation of the hyperplane. The MCD estimates are then equal to the classical estimates. In this case, you will need to modify your data before applying fastmcd, perhaps by using princomp and deleting columns with zero variance.
For univariate data sets, the exact algorithm location.lts is used. See the location.lts help file for more information.
covRob
.data(stack.dat)
fastmcd(stack.dat)
Run the code above in your browser using DataLab