rlars: Robust least angle regression

Description

Robustly sequence candidate predictors according to their predictive content and find the optimal model along the sequence.

Usage

rlars(x, ...)

  ## S3 method for class 'formula':
rlars(formula, data, ...)

  ## S3 method for class 'default':
rlars(x, y, sMax = NA,
    centerFun = median, scaleFun = mad, winsorize = FALSE,
    pca = FALSE, const = 2, prob = 0.95, fit = TRUE,
    s = c(0, sMax), regFun = lmrob, regArgs = list(),
    crit = c("BIC", "PE"), splits = foldControl(),
    cost = rtmspe, costArgs = list(),
    selectBest = c("hastie", "min"), seFactor = 1,
    ncores = 1, cl = NULL, seed = NULL, model = TRUE,
    tol = .Machine$double.eps^0.5, ...)

Arguments

formula

a formula describing the full model.

data

an optional data frame, list or environment (or object coercible to a data frame by as.data.frame) containing the variables in the model. If not found in data, the variables are taken fro

a matrix or data frame containing the candidate predictors.

a numeric vector containing the response.

sMax

an integer giving the number of predictors to be sequenced. If it is NA (the default), predictors are sequenced as long as there are twice as many observations as predictors.

centerFun

a function to compute a robust estimate for the center (defaults to median).

scaleFun

a function to compute a robust estimate for the scale (defaults to mad).

winsorize

a logical indicating whether to clean the full data set by multivariate winsorization, i.e., to perform data cleaning RLARS instead of plug-in RLARS (defaults to FALSE).

pca

a logical indicating whether a robust PCA step should be performed when computing the data cleaning weights for multivariate winsorization (defaults to FALSE). The distances of the observations are then computed on the PCA scores

const

numeric; tuning constant to be used in the initial corralation estimates based on adjusted univariate winsorization (defaults to 2).

prob

numeric; probability for the quantile of the $\chi^{2}$ distribution to be used in bivariate or multivariate winsorization (defaults to 0.95).

fit

a logical indicating whether to fit submodels along the sequence (TRUE, the default) or to simply return the sequence (FALSE).

an integer vector of length two giving the first and last step along the sequence for which to compute submodels. The default is to start with a model containing only an intercept (step 0) and iteratively add all variables along the sequence

regFun

a function to compute robust linear regressions along the sequence (defaults to lmrob).

regArgs

a list of arguments to be passed to regFun.

crit

a character string specifying the optimality criterion to be used for selecting the final model. Possible values are "BIC" for the Bayes information criterion and "PE" for resampling-based prediction error estimation.

splits

an object giving data splits to be used for prediction error estimation (see perry).

cost

a cost function measuring prediction loss (see perry for some requirements). The default is to use the root trimmed mean squared prediction error (see cos

costArgs

a list of additional arguments to be passed to the prediction loss function cost.

selectBest,seFactor

arguments specifying a criterion for selecting the best model (see perrySelect). The default is to use a one-standard-error rule.

ncores

a positive integer giving the number of processor cores to be used for parallel computing (the default is 1 for no parallelization). If this is set to NA, all available processor cores are used. For fitting models along the sequ

a parallel cluster for parallel computing as generated by makeCluster. This is preferred over ncores for tasks that are parallelized on the Rlevel, in which case

seed

optional initial seed for the random number generator (see .Random.seed). This is useful because many robust regression functions (including lmro

model

a logical indicating whether the model data should be included in the returned object.

tol

a small positive numeric value. This is used in bivariate winsorization to determine whether the initial estimate from adjusted univariate winsorization is close to 1 in absolute value. In this case, bivariate winsorization would fail since

...

additional arguments to be passed down. For the default method, additional arguments to be passed down to robStandardize.

Value

If fit is FALSE, an integer vector containing the indices of the sequenced predictors. Else if crit is "PE", an object of class "perryRlars" (inheriting from classes "perrySeqModel" and "perryTuning", see perryTuning). It contains information on the prediction error criterion, and includes the final model as component finalModel. Otherwise an object of class "rlars" (inheriting from class "seqModel") with the following components:
activean integer vector containing the indices of the sequenced predictors.
san integer vector containing the steps for which submodels along the sequence have been computed.
coefficientsa numeric matrix in which each column contains the regression coefficients of the corresponding submodel along the sequence.
fitted.valuesa numeric matrix in which each column contains the fitted values of the corresponding submodel along the sequence.
residualsa numeric matrix in which each column contains the residuals of the corresponding submodel along the sequence.
dfan integer vector containing the degrees of freedom of the submodels along the sequence (i.e., the number of estimated coefficients).
robusta logical indicating whether a robust fit was computed (TRUE for "rlars" models).
scalea numeric vector giving the robust residual scale estimates for the submodels along the sequence.
critan object of class "bicSelect" containing the BIC values and indicating the final model (only returned if argument crit is "BIC" and argument s indicates more than one step along the sequence).
muXa numeric vector containing the center estimates of the predictors.
sigmaXa numeric vector containing the scale estimates of the predictors.
muYnumeric; the center estimate of the response.
sigmaYnumeric; the scale estimate of the response.
xthe matrix of candidate predictors (if model is TRUE).
ythe response (if model is TRUE).
wa numeric vector giving the data cleaning weights (if winsorize is TRUE).
callthe matched function call.

References

Khan, J.A., Van Aelst, S. and Zamar, R.H. (2007) Robust linear model selection based on least angle regression. Journal of the American Statistical Association, 102(480), 1289--1299.

Examples

Run this code

## generate data
# example is not high-dimensional to keep computation time low
library("mvtnorm")
set.seed(1234)  # for reproducibility
n <- 100  # number of observations
p <- 25   # number of variables
beta <- rep.int(c(1, 0), c(5, p-5))  # coefficients
sigma <- 0.5      # controls signal-to-noise ratio
epsilon <- 0.1    # contamination level
Sigma <- 0.5^t(sapply(1:p, function(i, j) abs(i-j), 1:p))
x <- rmvnorm(n, sigma=Sigma)    # predictor matrix
e <- rnorm(n)                   # error terms
i <- 1:ceiling(epsilon*n)       # observations to be contaminated
e[i] <- e[i] + 5                # vertical outliers
y <- c(x %*% beta + sigma * e)  # response
x[i,] <- x[i,] + 5              # bad leverage points

## fit robust LARS model
rlars(x, y, sMax = 10)

Run the code above in your browser using DataLab