earth: Multivariate Adaptive Regression Splines

Description

Build a regression model using the techniques in Friedman's papers Multivariate Adaptive Regression Splines and Fast MARS.

Usage

## S3 method for class 'formula':
earth(formula, data, \dots)
## S3 method for class 'default':
earth(x = stop("no 'x' arg"), y = stop("no 'y' arg"),
      weights = NULL, subset = NULL, na.action = na.fail,
      penalty = if(degree > 1) 3 else 2, trace = 0, keepxy = FALSE,
      nk = max(21, 2 * NCOL(x) + 1), degree = 1, 
      linpreds = FALSE, allowed  = NULL,
      thresh = 0.001, minspan = 1, newvar.penalty = 0,
      fast.k = 20, fast.beta = 1,
      pmethod = "backward", ppenalty = penalty, nprune = NULL,
      Object  = NULL, Get.crit = get.gcv,
      Eval.model.subsets = eval.model.subsets,
      Print.pruning.pass = print.pruning.pass,
      Force.xtx.prune = FALSE, Use.beta.cache = TRUE, ...)

Arguments

formula

Model formula.

data

Data frame for formula.

Matrix containing the independent variables.

Vector containing the response variable, or in the case of multiple responses, a matrix whose columns are the response values for each variable. If the y values are very big or very small you may get better results if you

weights

Weight vector (not yet supported).

subset

Index vector specifying which cases to use i.e. which rows in x to use. Default is NULL, meaning all.

na.action

NA action. Default is na.fail, and only na.fail is supported.

penalty

Generalised Cross Validation (GCV) penalty per knot. Default is if(degree>1) 3 else 2. A value of 0 penalises only terms, not knots. The value -1 is a special case, meaning no penalty, so GCV=RSS/n. Theory suggests values in t

trace

Trace earth's execution. Default is 0. Values:

0 none 1 overview 2 forward pass 3 pruning 4 more pruning 5 ...

keepxy

Set to TRUE to retain x, y, and subset in the returned value. Default is FALSE The following arguments are for the forward pass

Maximum number of model terms before pruning. Includes the intercept. Default is max(21,2*NCOL(x)+1). The number of terms created by the forward pass will be less than nk if there are linearly dependent terms

degree

Maximum degree of interaction (Friedman's $mi$). Default is 1, meaning build an additive model (i.e. no interaction terms).

linpreds

Index vector specifying which predictors should enter linearly, as in lm. The default is FALSE, meaning all predictors enter in the standard MARS fashion i.e. in hinge functions. A predictor's index in

allowed

Function specifying which predictors can interact and how. Default is NULL, meaning all standard MARS terms are allowed. Earth calls the allowed function just before adding a term. If allowed returns TRUE the term goes

thresh

Forward stepping threshold. Default is 0.001. This is one of the arguments used to decide when forward stepping should terminate. See the section below on the forward pass.

minspan

Minimum distance between knots. The default value of 1 means consider all knots (which is good if the data are not noisy). The special value of 0 means calculate the minspan internally as per Friedman's MARS paper section 3.8

newvar.penalty

Penalty for adding a new variable in the forward pass (Friedman's $gamma$, equation 74 in the MARS paper). Default is 0, meaning no penalty for adding a new variable. Useful non-zero values range from about 0.01 to 0.2 --- you will nee

fast.k

Maximum number of parent terms considered at each step of the forward pass. Friedman invented this parameter to speed up the forward pass (see the Fast MARS paper section 3.0). Default is 20. Values of 0 or less are equivalent to infin

fast.beta

Fast MARS ageing coefficient, as described in the Fast MARS paper section 3.1. Default is 1. A value of 0 sometimes gives better results.

pmethod

Pruning method. Default is "backward". One of: backward none exhaustive forward seqrep. If y has multiple columns, then only backward or none is allowed. Pruning can

ppenalty

Like penalty but for the pruning pass. Default is penalty.

nprune

Maximum number of terms (including intercept) in the pruned model. Default is NULL, meaning all terms created by the forward pass (but typically not all terms will remain after pruning). Use this to reduce exhaustive search time, or to

Object

Earth object to be updated, for use by update.earth.

Get.crit

Criterion function for model selection during pruning. By default a function that returns the GCV. See the section below on the pruning pass.

Eval.model.subsets

Function to evaluate model subsets --- see notes in source code.

Print.pruning.pass

Function to print pruning pass results --- see notes in source code.

Force.xtx.prune

Default is FALSE. This argument pertains to subset evaluation in the pruning pass. By default, if y has a single column then earth calls the leaps routines; if <

Use.beta.cache

Default is TRUE. Using the "beta cache" takes more memory but is faster (by roughly 20% for large models). The beta cache uses nk * nk * ncol(x) * sizeof(double) bytes. Set Use.beta.cache=FALSE to save memory.

...

earth.formula: arguments passed to earth.default.

earth.default: unused, but provided for generic/method consistency.

Value

An object of class earth which is a list with the components listed below. Term refers to a term created during the forward pass (each line of the output from format.earth is a term). Term number 1 is always the intercept.
rssResidual sum-of-squares (RSS) of the model (summed over all responses if y has multiple columns).
rsq1-rss/rss.null. R-Squared of the model (calculated over all responses if y has multiple columns). A measure of how well the model fits the training data.
gcvGeneralised Cross Validation (GCV) of the model (summed over all responses if y has multiple columns). The GCV is calculated using ppenalty (as are all returned GCVs). For details of the GCV calculation, see equation 30 in Friedman's MARS paper and earth:::get.gcv.
grsq1-gcv/gcv.null. An estimate of the predictive power of the model (calculated over all responses if y has multiple columns). Unlike rsq, grsq can be negative. A negative grsq would indicate a severely over parameterised model --- a model that would not generalise well even though it may be a good fit to the training data. Example of a negative grsq: earth(mpg~., data=mtcars, pmethod="none", trace=4)
bxMatrix of basis functions applied to x. Each column corresponds to a selected term. Each row corresponds to a row in in the input matrix x, after taking subset. See model.matrix.earth for an example of bx handling. For brevity, "h" is used instead of "pmax" in column names. Example bx:(Intercept) h(Girth-12.9) h(12.9-Girth) h(Girth-12.9)*h(... [1,] 1 0.0 4.6 0 [2,] 1 0.0 4.3 0 [3,] 1 0.0 4.1 0 ...
dirsMatrix with one row per MARS term, and with with ij-th element equal to 0 if predictor j is not in term i -1 if a factor of the form pmax(c - xj) is in term i 1 if a factor of the form pmax(xj - c) is in term i 2 if predictor j enters term i linearly. This matrix includes all terms generated by the forward.pass, including those not in selected.terms. Note that the terms may not be in pairs, because the forward pass deletes linearly dependent terms before handing control to the pruning pass. Example dirs:Girth Height (Intercept) 0 0 #no factors in intercept h(Girth-12.9) 1 0 #2nd term uses Girth h(12.9-Girth) -1 0 #3rd term uses Girth h(Girth-12.9)*h(Height-76) 1 1 #4th term uses Girth and Height ...
cutsMatrix with ij-th element equal to the cut point for predictor j in term i. This matrix includes all terms generated by the forward.pass, including those not in selected.terms. Note that the terms may not be in pairs, because the forward pass deletes linearly dependent terms before handing control to the pruning pass. Example cuts:Girth Height (Intercept) 0.0 0 #intercept, no cuts h(Girth-12.9) 12.9 0 #2nd term has cut at 12.9 h(12.9-Girth) 12.9 0 #3rd term has cut at 12.9 h(Girth-12.9)*h(Height-76) 12.9 76 #4th term has two cuts ...
selected.termsVector of term numbers in the best model. Can be used as a row index vector into cuts and dirs. The first element selected.terms[1] is always 1, the intercept.
prune.termsA matrix specifying which terms appear in which subsets. The row index of prune.terms is the model size (the model size is the number of terms in the model). Each row is a vector of term numbers for the best model of that size. An element is 0 if the term is not in the model, thus prune.terms is a lower triangular matrix, with dimensions nprune x nprune. The model selected by the pruning pass is at row length(selected.terms). Example prune.terms:[1,] 1 0 0 0 0 0 0 #intercept-only model [2,] 1 2 0 0 0 0 0 #best 2 term model uses terms 1,2 [3,] 1 2 4 0 0 0 0 #best 3 term model uses terms 1,2,4 [4,] 1 2 9 8 0 0 0 #and so on ...
rss.per.responseA vector of the RSS for each response. Length is ncol(y). The rss component above is equal to sum(rss.per.response).
rsq.per.responseA vector of the R-Squared for each response. Length is ncol(y).
gcv.per.responseA vector of the GCV for each response. Length is ncol(y). The gcv component above is equal to sum(gcv.per.response).
grsq.per.responseA vector of the GRSq for each response. Length is ncol(y).
rss.per.subsetA vector of the RSS for each model subset generated by the pruning pass. Length is nprune. If y has multiple columns, the RSS is summed over all responses for each subset. The null RSS (i.e. the RSS of an intercept only-model) is rss.per.subset[1]. The rss above is rss.per.subset[length(selected.terms)].
gcv.per.subsetA vector of the GCV for each model in prune.terms. Length is is nprune. If y has multiple columns, the GCV is summed over all responses for each subset. The null GCV (i.e. the GCV of an intercept-only model) is gcv.per.subset[1]. The gcv above is gcv.per.subset[length(selected.terms)].
fitted.valuesFitted values. A matrix with dimensions nrow(y) x ncol(y).
residualsResiduals. A matrix with dimensions nrow(y) x ncol(y).
coefficientsRegression coefficients. A matrix with dimensions length(selected.terms) x ncol(y). Each column holds the least squares coefficients from regressing that column of y on bx. The first row holds the intercept coefficients.
ppenaltyThe GCV penalty used during pruning. A copy of earth's ppenalty argument.
callThe call used to invoke earth.
termsModel frame terms. This component exists only if the model was built using earth.formula.
x
y
subsetCopy of input arguments x, y, and subset. These components exist only if keepxy=TRUE.

concept

regression
mars
Friedman

References

The primary references are the Friedman papers. Readers may find the MARS section in Hastie, Tibshirani, and Friedman a more accessible introduction. Faraway takes a hands-on approach, using the ozone data to compare mda::mars with other techniques. (If you use Faraway's examples with earth instead of mars, use $bx instead of $x). Earth's pruning pass uses the leaps package which is based on techniques in Miller.

Faraway (2005) Extending the Linear Model with R http://www.maths.bath.ac.uk/~jjf23

Friedman (1991) Multivariate Adaptive Regression Splines (with discussion) Annals of Statistics 19/1, 1--141

Friedman (1993) Fast MARS Stanford University Department of Statistics, Technical Report 110 http://www-stat.stanford.edu/research/index.html

Hastie, Tibshirani, and Friedman (2001) The Elements of Statistical Learning http://www-stat.stanford.edu/~hastie/pub.htm

Miller, Alan (1990, 2nd ed. 2002) Subset Selection in Regression

Examples

Run this code

a <- earth(Volume ~ ., data = trees)
summary(a, digits = 2)

# yields:
#    Call:
#    earth(formula = Volume ~ ., data = trees)
#
#    Expression:
#      27
#      +    6 * pmax(0,  Girth -     14)
#      -  3.2 * pmax(0,     14 -  Girth)
#      + 0.61 * pmax(0, Height -     75)
#
#    Number of cases: 31
#    Selected 4 of 5 terms, and 2 of 2 predictors
#    Number of terms at each degree of interaction: 1 3 (additive model)
#    GCV: 11          RSS: 196         GRSq: 0.96      RSq: 0.98

Run the code above in your browser using DataLab