Learn R Programming

pense (version 1.2.1)

enpy: PY (Pena-Yohai) initial estimates for EN S-estimators

Description

Computes the PY initial estimates for the EN S-estimator with different strategies for computing the principal sensitivity components.

Usage

enpy(x, y, alpha, lambda, delta, cc, options = initest_options(),
  en_options = en_options_aug_lars())

Arguments

x

data matrix with predictors.

y

response vector.

alpha, lambda

EN penalty parameters (NOT adjusted for the number of observations in x).

delta

desired breakdown point of the resulting estimator.

cc

tuning constant for the S-estimator. Default is to chosen based on the breakdown point delta. Should never have to be changed.

options

additional options for the initial estimator. See initest_options for details.

en_options

additional options for the EN algorithm. See en_options for details.

Value

coeff

A numeric matrix with one initial coefficient per column

objF

A vector of values of the objective function for the respective coefficient

Details

Two different methods to calculate the sensitivity components are implemented:

"rr"

Approximate the PSCs by using the residuals from the elastic net fit and the hat matrix from the ridge regression. This method only works if alpha < 1 or ncol(x) < nrow(x).

"exact"

Calculate the PSCs from the difference between the residuals and leave-one-out residuals from elastic net.

References

Pena, D., and Yohai, V.J. (1999). A Fast Procedure for Outlier Diagnostics in Large Regression Problems. Journal of the American Statistical Association, 94(446), 434-445. http://doi.org/10.2307/2670164

Examples

Run this code
# NOT RUN {
set.seed(12345)
n <- 50
p <- 15
beta <- c(2:5, numeric(p - 4))
x <- 1 + matrix(rnorm(n * p), ncol = p)
y <- x %*% beta  + rnorm(n)

# add outliers to y
y[1:5] <- rnorm(5, -100, 0.5)

x_test <- matrix(rnorm(10 * n * p), ncol = p)
y_test <- drop(x_test %*% beta + rnorm(n))

# Compute the PY using the "exact" solutions. Here, the LOO residuals
# are computed using the EN estimator directly. For this method,
# the DAL algorithm is usually faster since it naturally supports
# "warm" starts from a nearby solution vector.
init_exact <- enpy(
    x, y,
    alpha = 0.8,
    lambda = 0.1,
    delta = 0.25,
    options = initest_options(
        psc_method = "exact"
    ),
    en_options = en_options_dal()
)

# Compute the PY using the approximate ("rr") solutions. Here, the LOO residuals
# are approximated by ridge regression, where all the LOO residuals can be
# computed at once using linear algebra.
init_approx <- enpy(
    x, y,
    alpha = 0.8,
    lambda = 0.1,
    delta = 0.25,
    options = initest_options(
        psc_method = "rr"
    ),
    en_options = en_options_aug_lars()
)

# How do they objective functions of the initial estimates compare?
min(init_exact$objF)
min(init_approx$objF)
# The "exact" method finds slightly estimates with slightly lower objective.

# How do the initial estimates perform on an external test set?
rmspe_exact <- colMeans((y_test - cbind(1, x_test) %*% init_exact$coeff)^2)
rmspe_approx <- colMeans((y_test - cbind(1, x_test) %*% init_approx$coeff)^2)

min(rmspe_exact)
min(rmspe_approx)
# The "exact" method also has better predictive power
# }

Run the code above in your browser using DataLab