dredge: Automated model selection

Description

Generate a set of models with combinations (subsets) of terms in the global model, with optional rules for model inclusion.

Usage

dredge(global.model, beta = c("none", "sd", "partial.sd"), evaluate = TRUE,
    rank = "AICc", fixed = NULL, m.lim = NULL, m.min, m.max, subset,
    trace = FALSE, varying, extra, ct.args = NULL, ...)
# S3 method for model.selection
print(x, abbrev.names = TRUE, warnings = getOption("warn") != -1L, ...)

Arguments

global.model

a fitted ‘global’ model object. See ‘Details’ for a list of supported types.

beta

indicates whether and how the coefficients estimates should be standardized, and must be one of "none", "sd" or "partial.sd". You can specify just the initial letter. "none" corresponds to unstandardized coefficients, "sd" and "partial.sd" to coefficients standardized by SD and Partial SD, respectively. For backwards compatibility, logical value is also accepted, TRUE is equivalent to "sd" and FALSE to "none". See std.coef.

evaluate

whether to evaluate and rank the models. If FALSE, a list of unevaluated calls is returned.

rank

optional custom rank function (returning an information criterion) to be used instead AICc, e.g. AIC, QAIC or BIC. See ‘Details’.

fixed

optional, either a single sided formula or a character vector giving names of terms to be included in all models. See ‘Subsetting’.

m.lim, m.max, m.min

optionally, the limits c(lower, upper) for number of terms in a single model (excluding the intercept). An NA means no limit. See ‘Subsetting’. Specifying limits as m.min and m.max is allowed for backward compatibility.

subset

logical expression describing models to keep in the resulting set. See ‘Subsetting’.

trace

if TRUE or 1, all calls to the fitting function are printed before actual fitting takes place. If trace > 1, a progress bar is displayed.

varying

optionally, a named list describing the additional arguments to vary between the generated models. Item names correspond to the arguments, and each item provides a list of choices (i.e. list(arg1 = list(choice1, choice2, ...), ...)). Complex elements in the choice list (such as family objects) should be either named (uniquely) or quoted (unevaluated, e.g. using alist, see quote), otherwise the result may be visually unpleasant. See example in Beetle.

extra

optional additional statistics to include in the result, provided as functions, function names or a list of such (best if named or quoted). Similarly as in rank argument, each function must accept fitted model object as an argument and return (a value coercible to) a numeric vector. These can be e.g. additional information criterions or goodness-of-fit statistics. The character strings "R^2" and "adjR^2" are treated in a special way, and will add a likelihood-ratio based R<U+00B2> and modified-R<U+00B2> respectively to the result (this is more efficient than using r.squaredLR directly).

a model.selection object, returned by dredge.

abbrev.names

should printed term names be abbreviated? (useful with complex models).

warnings

if TRUE, errors and warnings issued during the model fitting are printed below the table (only with pdredge). To permanently remove the warnings, set the object's attribute "warnings" to NULL.

ct.args

optional list of arguments to be passed to coefTable (e.g. dispersion parameter for glm affecting standard errors used in subsequent model averaging).

…

optional arguments for the rank function. Any can be an unevaluatec expression, in which case any x within it will be substituted with a current model.

Value

An object of class c("model.selection", "data.frame"), being a data.frame, where each row represents one model. See model.selection.object for its structure.

Details

Models are fitted through repeated evaluation of modified call extracted from the global.model (in a similar fashion as with update). This approach, while robust in that it can be applied to most model types is not the most efficient and may be computationally-intensive.

Note that the number of combinations grows exponentially with number of predictors (2<U+207F>, less when interactions are present, see below).

The fitted model objects are not stored in the result. To get (a subset of) models, use get.models on the object returned by dredge.

For a list of model types that can be used as a global.model see list of supported models. Modelling functions not storing call in their result should be evaluated via the wrapper function created by updateable.

Information criterion

rank is found by a call to match.fun and may be specified as a function or a symbol or a character string specifying a function to be searched for from the environment of the call to dredge. The function rank must accept model object as its first argument and always return a scalar.

Interactions

By default, marginality constraints are respected, so “all possible combinations” include only those containing interactions with their respective main effects and all lower order terms. However, if global.model makes an exception to this principle (e.g. due to a nested design such as a / (b + d)), this will be reflected in the subset models.

Subsetting

There are three ways to constrain the resulting set of models: setting limits to the number of terms in a model with m.lim, binding term(s) to all models with fixed, and more complex rules can be applied using argument subset. To be included in the selection table, the model formulation must satisfy all these conditions.

subset can take either a form of an expression or a matrix. The latter should be a lower triangular matrix with logical values, where columns and rows correspond to global.model terms. Value subset["a", "b"] == FALSE will exclude any model containing both terms a and b. demo(dredge.subset) has examples of using the subset matrix in conjunction with correlation matrices to exclude models containing collinear predictors.

In the form of expression, the argument subset acts in a similar fashion to that in the function subset for data.frames: model terms can be referred to by name as variables in the expression, with the difference being that are interpreted as logical values (i.e. equal to TRUE if the term exists in the model).

There is also .(x) and .(+x) notation indicating, respectively, any and all interactions including a term x. It is only useful with marginality exceptions.

The expression can contain any of the global.model terms (getAllTerms(global.model) lists them), as well as names of the varying argument items. Names of global.model terms take precedence when identical to names of varying, so to avoid ambiguity varying variables in subset expression should be enclosed in V() (e.g. subset = V(family) == "Gamma" assuming that varying is something like list(family = c(..., "Gamma"))).

If item names in varying are missing, the items themselves are coerced to names. Call and symbol elements are represented as character values (via deparse), and everything except numeric, logical, character and NULL values is replaced by item numbers (e.g. varying = list(family = list(..., Gamma) should be referred to as subset = V(family) == 2. This can quickly become confusing, therefore it is recommended to use named lists. demo(dredge.varying) provides examples.

The subset expression can also contain variable `*nvar*` (backtick-quoted), equal to number of terms in the model (not the number of estimated parameters).

To make inclusion of a model term conditional on presence of another model term, the function dc (“dependency chain”) can be used in the subset expression. dc takes any number of term names as arguments, and allows a term to be included only if all preceding ones are also present (e.g. subset = dc(a, b, c) allows for models a, a+b and a+b+c but not b, c, b+c or a+c).

subset expression can have a form of an unevaluated call, expression object, or a one sided formula. See ‘Examples’.

Compound model terms (such as interactions, ‘as-is’ expressions within I() or smooths in gam) should be enclosed within curly brackets (e.g. {s(x,k=2)}), or backticks (like non-syntactic names, e.g. `s(x, k = 2)` ). Backticks-quoted names must match exactly (including whitespace) the term names as given by getAllTerms.

`subset` expression syntax summary

a & b: indicates that model terms a and b must be present (see Logical Operators)
{log(x,2)} or `log(x, 2)`: represent a complex model term log(x, 2)
V(x): represents a varying variable x
.(x): indicates that at least one term containing the term x must be present
.(+x): indicates that all the terms containing the term x must be present
dc(a, b, c,...): ‘dependency chain’: b is allowed only if a is present, and c only if both a and b are present, etc.
`*nvar*`: number of terms.

To simply keep certain terms in all models, use of argument fixed is much more efficient. The fixed formula is interpreted in the same manner as model formula and so the terms need not to be quoted.

Missing values

Use of na.action = "na.omit" (R's default) or "na.exclude" in global.model must be avoided, as it results with sub-models fitted to different data sets, if there are missing values. Error is thrown if it is detected.

It is a common mistake to give na.action as an argument in the call to dredge (typically resulting in an error from the rank function to which the argument is passed through ‘…’), while the correct way is either to pass na.action in the call to the global model or to set it as a global option.

Methods

There are subset and plot methods, the latter creates a graphical representation of model weights and variable relative importance. Coefficients can be extracted with coef or coefTable.

Examples

Run this code

# NOT RUN {
# Example from Burnham and Anderson (2002), page 100:

#  prevent fitting sub-models to different datasets
# }
# NOT RUN {
options(na.action = "na.fail")

fm1 <- lm(y ~ ., data = Cement)
dd <- dredge(fm1)
subset(dd, delta < 4)

# Visualize the model selection table:
# }
# NOT RUN {
par(mar = c(3,5,6,4))
plot(dd, labAsExpr = TRUE)
# }
# NOT RUN {
# Model average models with delta AICc < 4
model.avg(dd, subset = delta < 4)

#or as a 95% confidence set:
model.avg(dd, subset = cumsum(weight) <= .95) # get averaged coefficients

#'Best' model
summary(get.models(dd, 1)[[1]])

# }
# NOT RUN {
# Examples of using 'subset':
# keep only models containing X3
dredge(fm1, subset = ~ X3) # subset as a formula
dredge(fm1, subset = expression(X3)) # subset as expression object
# the same, but more effective:
dredge(fm1, fixed = "X3")
# exclude models containing both X1 and X2 at the same time
dredge(fm1, subset = !(X1 && X2))
# Fit only models containing either X3 or X4 (but not both);
# include X3 only if X2 is present, and X2 only if X1 is present.
dredge(fm1, subset = dc(X1, X2, X3) && xor(X3, X4))
# the same as above, without "dc"
dredge(fm1, subset = (X1 | !X2) && (X2 | !X3) && xor(X3, X4))

# Include only models with up to 2 terms (and intercept)
dredge(fm1, m.lim = c(0, 2))
# }
# NOT RUN {
# Add R^2 and F-statistics, use the 'extra' argument
dredge(fm1, m.lim = c(NA, 1), extra = c("R^2", F = function(x)
    summary(x)$fstatistic[[1]]))

# with summary statistics:
dredge(fm1, m.lim = c(NA, 1), extra = list(
    "R^2", "*" = function(x) {
        s <- summary(x)
        c(Rsq = s$r.squared, adjRsq = s$adj.r.squared,
            F = s$fstatistic[[1]])
    })
)

# Add other information criterions (but rank with AICc):
dredge(fm1, m.lim = c(NA, 1), extra = alist(AIC, BIC, ICOMP, Cp))
# }

Run the code above in your browser using DataLab