lmSubsets: All-Subsets Regression

Description

All-subsets regression for linear models estimated by ordinary least squares (OLS).

Usage

lmSubsets(formula, …)
# S3 method for default
lmSubsets(formula, data, subset, weights, na.action, model = TRUE,
          x = FALSE, y = FALSE, contrasts = NULL, offset, …)
# S3 method for matrix
lmSubsets(formula, y, intercept = TRUE, …)
lmSubsets_fit(x, y, weights = NULL, offset = NULL, include = NULL,
              exclude = NULL, nmin = NULL, nmax = NULL,
              tolerance = 0, nbest = 1, …, pradius = NULL)

Arguments

formula, data, subset, weights, na.action, model, x, y, contrasts, offset

Standard formula interface.

intercept

Include intercept.

include, exclude

Force regressors in or out.

nmin, nmax

Minimum and maximum number of regressors.

tolerance

Approximation tolerance (vector).

nbest

Number of best subsets.

…

Forwarded to lmSubsets.default and lmSubsets_fit.

pradius

Preordering radius.

Value

An object of class "lmSubsets", i.e., a list with the following components:

nobs, nvar

Number of observations, of variables.

intercept

TRUE if model has intercept term; FALSE otherwise.

include, exclude

Included, excluded regressors.

size

Subset sizes.

tolerance

Approximation tolerance (vector).

nbest

Number of best subsets.

submodel

Submodel information.

subset

Selected variables.

Further components include call, na.action, weights, offset, contrasts, xlevels, terms, mf, x, and y. See lm for more information.

Details

The lmSubsets generic provides various methods to conveniently specify the regressor and response variables. The standard formula interface (see lm) can be used, or the information can be extracted from an already fitted "lm" object. The regressor matrix and response variable can also be passed in directly (see Examples).

The call is forwarded to lmSubsets_fit, which provides a low-level matrix interface.

The nbest best subset models for every subset size are computed, where the "best" models are the models with the lowest residual sum of squares (RSS). The scope of the search can be limited to a range of subset sizes by setting nmin and nmax. A tolerance vector (expanded if necessary) may be specified to speed up the search, where tolerance[j] is the tolerance applied to subset models of size j.

By way of include and exclude, variables may be forced in to or out of the regression, respectively.

The extent to which variables are preordered is controlled with the pradius parameter.

A set of standard extractor functions for fitted model objects is available for objects of class "lmSubsets". See methods for more details.

The summary method can be called to obtain summary statistics.

References

Hofmann M, Gatu C, Kontoghiorghes EJ, Colubi A, Zeileis A (2020). lmSubsets: Exact Variable-Subset Selection in Linear Regression for R. Journal of Statistical Software. 93, 1--21. doi:10.18637/jss.v093.i03.

Hofmann M, Gatu C, Kontoghiorghes EJ (2007). Efficient Algorithms for Computing the Best Subset Regression Models for Large-Scale Problems. Computational Statistics \& Data Analysis, 52, 16--29. doi:10.1016/j.csda.2007.03.017.

Gatu C, Kontoghiorghes EJ (2006). Branch-and-Bound Algorithms for Computing the Best Subset Regression Models. Journal of Computational and Graphical Statistics, 15, 139--156. doi:10.1198/106186006x100290.

Examples

Run this code

# NOT RUN {
## load data (with logs for relative potentials)
data("AirPollution", package = "lmSubsets")


###################
##  basic usage  ##
###################

## canonical example: fit all subsets
lm_all <- lmSubsets(mortality ~ ., data = AirPollution, nbest = 5)
lm_all

## plot RSS and BIC
plot(lm_all)

## summary statistics
summary(lm_all)


############################
##  forced in-/exclusion  ##
############################

lm_force <- lmSubsets(lm_all, include = c("nox", "so2"),
                      exclude = "whitecollar")
lm_force


########################
##  matrix interface  ##
########################

## same as above
x <- as.matrix(AirPollution)

lm_mat <- lmSubsets(x, y = "mortality")
lm_mat
# }

Run the code above in your browser using DataLab