pyinit: PY (Pena-Yohai) initial estimates for S-estimates of regression

Description

Computes the PY initial estimates for S-estimates of regression.

Usage

pyinit(
  x,
  y,
  intercept = TRUE,
  delta = 0.5,
  cc,
  maxit = 10,
  psc_keep,
  resid_keep_method = c("threshold", "proportion"),
  resid_keep_prop,
  resid_keep_thresh,
  eps = 1e-08,
  mscale_maxit = 200,
  mscale_tol = eps,
  mscale_rho_fun = c("bisquare", "huber", "gauss")
)

Arguments

a matrix with the data, each observation in a row.

the response vector.

intercept

logical, should an intercept be included in the model? Defaults to TRUE.

delta, cc

parameters for the M-scale estimator equation. If cc is missing it will be set to yield consistency under the Normal model for the given delta (right-hand side of the M-scale equation).

maxit

the maximum number of iterations to perform.

psc_keep

proportion of observations to keep based on PSCs.

resid_keep_method

how to clean the data based on large residuals. If "threshold", all observations with scaled residuals larger than resid_keep_thresh will be removed (resid_keep_thresh corresponds to the constant \(C_1\) from equation (21) in Pena & Yohai (1999). If "proportion", observations with the largest resid_keep_prop residuals will be removed.

resid_keep_prop, resid_keep_thresh

see parameter resid_keep_method for details.

eps

the relative tolerance for convergence. Defaults to 1e-8.

mscale_maxit

maximum number of iterations allowed for the M-scale algorithm. Defaults to 200.

mscale_tol

convergence threshold for the m-scale

mscale_rho_fun

A string containing the name of the rho function to use for the M-scale. Valid options are bisquare, huber and gauss.

Value

coefficients

numeric matrix with coefficient vectors in columns. These are regression estimators based on "cleaned" subsets of the data. The M-scales of the corresponding residuals are returned in the entry objective. The regression coefficients with smallest estimated residual scale is in the first column, but the others need not be ordered.

objective

vector of values of the M-scale estimate of the residuals associated with each vector of regression coefficients in the columns of coefficients.

References

Pena, D., & Yohai, V.. (1999). A Fast Procedure for Outlier Diagnostics in Large Regression Problems. Journal of the American Statistical Association, 94(446), 434-445. <doi:10.2307/2670164>

Examples

Run this code

# NOT RUN {
# generate a simple synthetic data set for a linear regression model
# with true regression coefficients all equal to one "(1, 1, 1, 1, 1)"
set.seed(123)
x <- matrix(rnorm(100*4), 100, 4)
y <- rnorm(100) + rowSums(x) + 1
# add masked outliers
a <- svd(var(x))$v[,4]
x <- rbind(x, t(outer(a, rnorm(20, mean=4, sd=1))))
y <- c(y, rnorm(20, mean=-2, sd=.2))

# these outliers are difficult to find
plot(lm(y~x), ask=FALSE)

# use pyinit to obtain estimated regression coefficients
tmp <- pyinit(x=x, y=y, resid_keep_method='proportion', psc_keep = .5, resid_keep_prop=.5)
# the vector of regression coefficients with smallest residuals scale
# is returned in the first column of the "coefficients" element
tmp$coefficients[,1]
# compare that with the LS estimator on the clean data
coef(lm(y~x, subset=1:100))
# compare it with the LS estimator on the full data
coef(lm(y~x))


# }

Run the code above in your browser using DataLab