enetLTS: Robust and sparse estimation for linear and logistic regression

Description

Compute fully robust versions of the elastic net estimator, which allows for sparse model estimates, for linear and logistic regression.

Usage

enetLTS(xx, yy, family=c("gaussian","binomial"),
alphas, lambdas, lambdaw, hsize=0.75,
intercept=TRUE, nsamp=500, s1=10, nCsteps=20, nfold=5,
seed=NULL, plot=TRUE, repl=5, para=TRUE, ncores=1,
del=0.0125, tol=-1e6, scal=TRUE, type=c("response","class"))

Arguments

a numeric matrix containing the predictor variables.

response variable. Quantitative for family="gaussian". For family="binomial" should be a factor with two levels which is coded as 0 and 1.

family

a description of the error distribution and link function to be used in the model. "gaussian" and "binomial" options are available.

alphas

a user supplied alpha sequence for the elastic net penalty, which is the mixing proportion of the ridge and lasso penalties and takes value in [0,1]. \(\alpha=1\) is the lasso penalty, and \(\alpha=0\) the ridge penalty. If not provided a sequence, default is 41 equally spaced values.

lambdas

a user supplied lambda sequence for the strength of the elastic net penalty. If not provided a sequence, default is chosen with steps of size -0.025 lambda0 with \(0\le\lambda\le lambda0\) for linear regression and -0.025 lambda00 with \(0\le\lambda\le lambda00\) for logistic regression. lambda0 is determined based on the Pearson correlation between y and the jth predictor variable x_j on winsorized data for linear regression. In lambda00 for logistic regression, the Pearson correlation is replaced by a robustified point-biserial correlation.

lambdaw

a user supplied lambda sequence for reweighting step. If not provided, default is computed by using k-fold cross-validation via cv.glmnet function.

hsize

a user supplied numeric value giving the percentage of the residuals for which the elastic net penalized sum of squares for linear regression or for which the elastic net penalized sum of deviances for logistic regression should be minimized. The default is 0.75.

intercept

a logical indicating whether a constant term should be included in the model (the default is TRUE).

scal

a logical indicating whether scale the predictors by their arithmetic means and standard deviations. For family="gaussian", it also indicates if mean-center the response variable or not. The default is TRUE. Note that scaling is performed on the subsamples rather than the full data set.

nsamp

a numeric vector giving the number of subsamples to be used in the beginning of the algorithm, which gives the number of initial subsamples to be used. The default is to first perform C-steps on 500 initial subsamples, and then to keep the s1 subsamples with the lowest value (or highest value based on which model is used - "gaussian" or "binomial") of the objective function for additional C-steps until convergence.

a number of subsamples to keep after perform C-steps on nsamp initial subsets. For those remaining subsets, additional C-steps are performed until convergence. The default is 10.

nCsteps

a positive integer giving the number of C-steps to perform on determined s1 subsamples. The default is 20.

nfold

a user supplied numeric value for fold number of k-fold cross-validation which used in varied functions of the algorithm. The default is 5-fold cross-validation.

seed

optional initial seed for the random number generator (see.Random.seed ) when determine initial subsets at the beginning of the algorithm. The default is NULL.

plot

a logical indicating if produces a plot for k-fold cross-validation based on alpha and lambda combinations. The default is TRUE.

repl

a user supplied positive number for more stable results, repeat the k-fold CV repl times and take the average of the corresponding evaluation measure. The default is 5.

para

if TRUE, use parallel to fit each fold. Must register parallel before hand, such as doMC or others.

ncores

a positive integer giving the number of processor cores to be used for parallel computing (the default is 1 for no parallelization). If this is set to NA, all available processor cores are used. For prediction error estimation, parallel computing is implemented on the R level using package parallel.

del

The default is 0.0125.

tol

a small numeric value for convergence. The default is -1e6.

type

type of prediction required. type="response" gives the fitted probabilities for "binomial" and gives the fitted values for "gaussian". type="class" is available only for "binomial" model, and produces the class label corresponding to the maximum probability.

Value

objective

a numeric vector giving the respective values of the enetLTS objective function, i.e., the elastic net penalized sums of the \(h\) smallest squared residuals from the raw fits for family="gaussian" and the elastic net penalized sums of the \(h\) deviances from the raw fits for family="binomial".

best

an integer vector containing the respective best subsets of \(h\) observations found and used for computing the raw estimates.

raw.wt

an integer vector containing binary weights that indicate outliers from the respective raw fits, i.e., the weights used for the reweighted fits.

an integer vector containing binary weights that indicate outliers from the respective reweighted fits, i.e., the weights are \(1\) for observations with reasonably small reweighted residuals and \(0\) for observations with large reweighted residuals.

a00

intercept term obtained from the raw fit.

raw.coefficients

a numeric vector containing the respective coefficient estimates from the raw fit.

intercept term obtained from the reweighted fit.

coefficients

a numeric vector containing the respective coefficient estimates from the reweighted fit.

alpha

an optimal elastic net mixing parameter value obtained with k-fold cross-validation.

lambda

an optimal value for the strength of the elastic net penalty obtained with k-fold cross-validation.

lambdaw

an optimal value for the strength of the elastic net penalty re-obtained with k-fold cross-validation for reweighted fit.

num.nonzerocoef

the number of the nonzero coefficients in the model.

the number of observations used to compute the raw estimates.

raw.residuals

a numeric vector containing the respective residuals from the raw fits.

residuals

a numeric vector containing the respective residuals from the reweighted fits.

raw.fitted.values

a numeric vector containing the respective fitted values of the response from the raw fits.

fitted.values

a numeric vector containing the respective fitted values of the response from the reweighted fits.

raw.rmse

root mean squared error for raw fit, which is available for only family="gaussian".

rmse

root mean squared error for reweighted fit, which is available for only family ="gaussian".

classnames

class names for logistic model, which is available for only family="binomial".

classize

class sizes for logisitic model, which is available for only family="binomial".

inputs

all inputs used in the function enetLTS.R.

call

the matched function call.

Details

The idea of repeatedly applying the non-robust classical elastic net estimators to data subsets only is used for linear and logistic regression. The algorithm starts with 500 elemental subsets only for one combination of \(\alpha\) and \(\lambda\), and takes the warm start strategy for subsequent combinations. This idea saves the computation time. To choose the elastic net penalties, k-fold cross-validation is used and the replication option is provided for more stable results. Robustness has been achieved by using trimming idea, therefore a reweighting step is introduced in order to improve the efficiency. The outliers are identified according to current model. For family="gaussian", standardized residuals are used. For family="binomial", the Pearson residuals which are approximately standard normally distributed is used. Then the weights are defined by the binary weight function using del=0.0125, which allows to be flagged as outliers of the 2.5% of the observations in the normal model. Therefore, binary weight function produces a clear distinction between the "good observations" and "outliers".

References

Kurnaz, F.S., Hoffmann, I. and Filzmoser, P. (2017) Robust and sparse estimation methods for high dimensional linear and logistic regression. Chemometrics and Intelligent Laboratory Systems.

Examples

Run this code

# NOT RUN {
## for gaussian

set.seed(86)
n <- 100; p <- 25                             # number of observations and variables
beta <- rep(0,p); beta[1:6] <- 1              # 10% nonzero coefficients
sigma <- 0.5                                  # controls signal-to-noise ratio
x <- matrix(rnorm(n*p, sigma),nrow=n)
e <- rnorm(n,0,1)                             # error terms
eps <- 0.1                                    # contamination level
m <- ceiling(eps*n)                           # observations to be contaminated
eout <- e; eout[1:m] <- eout[1:m] + 10        # vertical outliers
yout <- c(x %*% beta + sigma * eout)          # response
xout <- x; xout[1:m,] <- xout[1:m,] + 10      # bad leverage points

# }
# NOT RUN {
fit <- enetLTS(xout,yout,alphas=0.5,lambdas=0.05,plot=FALSE)
# determine user supplied alpha and lambda sequences
# alphas=seq(0,1,length=11)
# l0 <- robustHD::lambda0(xout,yout)          # use # lambda0 function from robustHD package
# lambdas <- seq(l0,0,by=-0.1*l0)
# fit <- enetLTS(xout,yout,alphas=alphas,lambdas=lambdas)
# }
# NOT RUN {
## for binomial

eps <-0.05                                     # %10 contamination to only class 0
m <- ceiling(eps*n)
y <- sample(0:1,n,replace=TRUE)
xout <- x
xout[y==0,][1:m,] <- xout[1:m,] + 10;          # class 0
yout <- y                                      # wrong classification for vertical outliers
# }
# NOT RUN {
	fit <- enetLTS(xout,yout,family="binomial",alphas=0.5,lambdas=0.05,plot=FALSE)
# determine user supplied alpha and lambda sequences
# l00 <- lambda00(xout,yout,normalize=TRUE,intercept=TRUE)
# lambdas <-  seq(l00,0,by=-0.01*l00)
# fit <- enetLTS(xout,yout,family="binomial",alphas=alphas,lambdas=lambdas)
# }

Run the code above in your browser using DataLab