earth: Multivariate Adaptive Regression Splines

Description

Build a regression model using the techniques in Friedman's papers "Multivariate Adaptive Regression Splines" and "Fast MARS".

Usage

## S3 method for class 'formula':
earth(formula = stop("no 'formula' arg"),
   data, weights = NULL, wp = NULL, scale.y = (NCOL(y)==1), subset = NULL,
   na.action = na.fail, glm = NULL, trace = 0,
   keepxy = FALSE, nfold=0, stratify=TRUE, ...)
## S3 method for class 'default':
earth(x = stop("no 'x' arg"), y = stop("no 'y' arg"),
    weights = NULL, wp = NULL, scale.y = (NCOL(y)==1), subset = NULL,
    na.action = na.fail, glm = NULL, trace = 0,
    keepxy = FALSE, nfold=0, stratify=TRUE, ...)
## S3 method for class 'fit':
earth(x = stop("no 'x' arg"), y = stop("no 'y' arg"),
    weights = NULL, wp = NULL, scale.y = (NCOL(y)==1), subset = NULL,
    na.action = na.fail, glm = NULL, trace = 0,
    nk = max(21, 2 * ncol(x) + 1), degree = 1,
    penalty = if(degree > 1) 3 else 2, thresh = 0.001,
    minspan = 0, newvar.penalty = 0, fast.k = 20, fast.beta = 1,
    linpreds = FALSE, allowed = NULL,
    pmethod = "backward", nprune = NULL, Object = NULL,
    Get.crit = get.gcv, Eval.model.subsets = eval.model.subsets,
    Print.pruning.pass = print.pruning.pass, Force.xtx.prune = FALSE,
    Use.beta.cache = TRUE, ...)

Arguments

formula

Model formula.

data

Data frame for formula.

Matrix or dataframe containing the independent variables.

Vector containing the response variable, or, in the case of multiple responses, a matrix or dataframe whose columns are the values for each response.

subset

Index vector specifying which cases to use, i.e., which rows in x to use. Default is NULL, meaning all.

weights

Weight vector (not yet supported).

Vector of response weights. Default is NULL, meaning no response weights. If specified, wp must have an element for each column of y (after factors, if any, h

scale.y

Scale y in the forward pass for better numeric stability. Scaling here means subtract the mean and divide by the standard deviation. Default is NCOL(y)==1, i.e., sc

na.action

NA action. Default is na.fail, and only na.fail is supported. (Why? Because adding support to earth for other NA actions is easy, but making sure that is handled correctly internally in predict

glm

NULL (default) or a list of arguments to pass on to glm. See the documentation of glm for a description of these arguments (but not all of glm's arguments

trace

Trace earth's execution. Default is 0. Values: 0 none 0.5 cross validation 1 overview 2 forward pass 3 pruning 4 model mats, memory use, more pruning, etc. 5 and higher for internal details of operation...

keepxy

Set to TRUE to retain the following in the returned value: x and y (or data), subset, and weights. Default is FALSE. The function up

Maximum number of model terms before pruning, i.e., the maximum number of terms created by the forward pass. Includes the intercept. Default is max(21,2*NCOL(x)+1). Use trace=1 or more to see why the forward p

degree

Maximum degree of interaction (Friedman's $mi$). Default is 1, meaning build an additive model (i.e., no interaction terms).

penalty

Generalized Cross Validation (GCV) penalty per knot. Default is if(degree>1) 3 else 2. A value of 0 penalizes only terms, not knots. The value -1 is treated specially to mean no penalty, so GCV=RSS/n. Simulation studies have s

thresh

Forward stepping threshold. Default is 0.001. This is one of the arguments used to decide when forward stepping should terminate. See the "Forward pass" section below.

minspan

Minimum distance between knots. The minspan argument is intended to increase resistance to runs of noise in the input data. (Note: predictor value extremes are ineligible for knots regardless of the minspan setting, a

newvar.penalty

Penalty for adding a new variable in the forward pass (Friedman's $gamma$, equation 74 in the MARS paper). Default is 0, meaning no penalty for adding a new variable. Useful non-zero values range from about 0.01 to 0.2 --- you will nee

fast.k

Maximum number of parent terms considered at each step of the forward pass. Friedman invented this parameter to speed up the forward pass (see the Fast MARS paper section 3.0). Default is 20. Values of 0 or less are treated specially

fast.beta

Fast MARS aging coefficient, as described in the Fast MARS paper section 3.1. Default is 1. A value of 0 sometimes gives better results.

linpreds

Index vector specifying which predictors should enter linearly, as in lm. The default is FALSE, meaning all predictors enter in the standard MARS fashion, i.e., in hinge functions. A predictor's index in

allowed

Function specifying which predictors can interact and how. Default is NULL, meaning all standard MARS terms are allowed. Earth calls the allowed function just before adding a term in the forward pass. If allowed re

nfold

Number of cross validation folds. Default is 0, i.e., no cross validation. If greater than 1, earth first builds a standard model as usual, with all the data. It then builds nfold cross validated models, measuring R-Square

stratify

Only applies if nfold>1. Default is TRUE. Stratify the cross validation samples so that an approximately equal number of cases with a non-zero response occur in each cross validation subset. This means that when the r

pmethod

Pruning method. Default is "backward". One of: backward none exhaustive forward seqrep. Use none to retain all the terms created by the forward pass. If y has multiple columns, then o

nprune

Maximum number of terms (including intercept) in the pruned model. Default is NULL, meaning all terms created by the forward pass (but typically not all terms will remain after pruning). Use this to enforce an upper bound on the model size

Object

Earth object to be updated, for use by update.earth.

Get.crit

Criterion function for model selection during pruning. By default a function that returns the GCV. See the "Pruning pass" section below.

Eval.model.subsets

Function to evaluate model subsets --- see notes in source code.

Print.pruning.pass

Function to print pruning pass results --- see notes in source code.

Force.xtx.prune

Default is FALSE. This argument pertains to subset evaluation in the pruning pass. By default, if y has a single column then earth calls the leaps routines; if <

Use.beta.cache

Default is TRUE. Using the "beta cache" takes more memory but is faster (by 20% and often much more for large models). The beta cache uses nk * nk * ncol(x) * sizeof(double) bytes. Set Use.beta.cache=FALSE to save

...

Dots are passed on to earth.fit.

Value

An object of class "earth" which is a list with the components listed below. Term refers to a term created during the forward pass (each line of the output from format.earth is a term, and dirs defines which predictors are in which terms). Term number 1 is always the intercept.
rssResidual sum-of-squares (RSS) of the model (summed over all responses if y has multiple columns).
rsq1-rss/rss.null. R-Squared of the model (calculated over all responses). A measure of how well the model fits the training data. Note that rss.null is sum((y - mean(y))^2).
gcvGeneralized Cross Validation (GCV) of the model (summed over all responses) The GCV is calculated using the penalty argument. For details of the GCV calculation, see equation 30 in Friedman's MARS paper and earth:::get.gcv.
grsq1-gcv/gcv.null. An estimate of the predictive power of the model (calculated over all responses). Unlike rsq, in MARS models grsq can be negative. A negative grsq would indicate a severely over parametrized model --- a model that would not generalize well even though it may be a good fit to the training data. Watch the GRSq take a nose dive in this example: earth(mpg~., data=mtcars, pmethod="none", trace=3)
bxMatrix of basis functions applied to x. Each column corresponds to a selected term. Each row corresponds to a row in in the input matrix x, after taking subset. See model.matrix.earth for an example of bx handling. Example bx:(Intercept) h(Girth-12.9) h(12.9-Girth) h(Girth-12.9)*h(... [1,] 1 0.0 4.6 0 [2,] 1 0.0 4.3 0 [3,] 1 0.0 4.1 0 ...
dirsMatrix with one row per MARS term, and with with ij-th element equal to 0 if predictor j is not in term i -1 if an expression of the form h(const - xj) is in term i 1 if an expression of the form h(xj - const) is in term i 2 if predictor j enters term i linearly. This matrix includes all terms generated by the forward.pass, including those not in selected.terms. Note that the terms may not be in pairs, because the forward pass deletes linearly dependent terms before handing control to the pruning pass. Example dirs:Girth Height (Intercept) 0 0 #intercept h(Girth-12.9) 1 0 #2nd term uses Girth h(12.9-Girth) -1 0 #3rd term uses Girth h(Girth-12.9)*h(Height-76) 1 1 #4th term uses Girth and Height ...
cutsMatrix with ij-th element equal to the cut point for predictor j in term i. This matrix includes all terms generated by the forward.pass, including those not in selected.terms. Note that the terms may not be in pairs, because the forward pass deletes linearly dependent terms before handing control to the pruning pass. Note for programmers: the precedent is to use dirs for term names etc. and to only use cuts where cut information needed. Example cuts:Girth Height (Intercept) 0 0 #intercept, no cuts h(Girth-12.9) 12.9 0 #2nd term has cut at 12.9 h(12.9-Girth) 12.9 0 #3rd term has cut at 12.9 h(Girth-12.9)*h(Height-76) 12.9 76 #4th term has two cuts ...
selected.termsVector of term numbers in the best model. Can be used as a row index vector into cuts and dirs. The first element selected.terms[1] is always 1, the intercept.
prune.termsA matrix specifying which terms appear in which pruning pass subsets. The row index of prune.terms is the model size. (The model size is the number of terms in the model. The intercept is counted as a term.) Each row is a vector of term numbers for the best model of that size. An element is 0 if the term is not in the model, thus prune.terms is a lower triangular matrix, with dimensions nprune x nprune. The model selected by the pruning pass is at row number length(selected.terms). Example prune.terms:[1,] 1 0 0 0 0 0 0 #intercept-only model [2,] 1 2 0 0 0 0 0 #best 2 term model uses terms 1,2 [3,] 1 2 4 0 0 0 0 #best 3 term model uses terms 1,2,4 [4,] 1 2 6 9 0 0 0 #and so on ...
rss.per.responseA vector of the RSS for each response. Length is the number of responses, i.e., ncol(y) after factors in y have been expanded. The rss component above is equal to sum(rss.per.response).
rsq.per.responseA vector of the R-Squared for each response. Length is the number of responses.
gcv.per.responseA vector of the GCV for each response. Length is the number of responses. The gcv component above is equal to sum(gcv.per.response).
grsq.per.responseA vector of the GRSq for each response. Length is the number of responses.
rss.per.subsetA vector of the RSS for each model subset generated by the pruning pass. Length is nprune. For multiple responses, the RSS is summed over all responses for each subset. The null RSS (i.e., the RSS of an intercept only-model) is rss.per.subset[1]. The rss above is rss.per.subset[length(selected.terms)].
gcv.per.subsetA vector of the GCV for each model in prune.terms. Length is nprune. For multiple responses, the GCV is summed over all responses for each subset. The null GCV (i.e., the GCV of an intercept-only model) is gcv.per.subset[1]. The gcv above is gcv.per.subset[length(selected.terms)].
fitted.valuesFitted values. A matrix with dimensions nrow(y) x ncol(y) after factors in y have been expanded.
residualsResiduals. A matrix with dimensions nrow(y) x ncol(y) after factors in y have been expanded.
coefficientsRegression coefficients. A matrix with dimensions length(selected.terms) x ncol(y) after factors in y have been expanded. Each column holds the least squares coefficients from regressing that column of y on bx. The first row holds the intercept coefficient(s).
penaltyThe GCV penalty used during pruning. A copy of earth's penalty argument.
callThe call used to invoke earth.
termsModel frame terms. This component exists only if the model was built using earth.formula.
namesxColumn names of x, generated internally by earth when necessary so each column of x has a name. Used, for example, by predict.earth to name columns if necessary.
namesx.orgOriginal column names of x.
levelsLevels of y if y is a factor c(FALSE,TRUE) if y is logical Else NULL
wpCopy of the wp argument to earth. The following fields appear only if earth's argument keepxy is TRUE.
x
y
data
subset
weightsCopies of the corresponding arguments to earth. Only exist if keepxy=TRUE. The following fields appear only if earth's glm argument is used.
glm.listList of GLM models. Each element is the value returned by earth's internal call to glm for each response. Thus if there is a single response (or a single binomial pair, see the "Binomial pairs" section below) this will be a one element list and you access the GLM model with my.earth.model$glm.list[[1]].
glm.coefficientsGLM regression coefficients. Analogous to the coefficients field described above but for the GLM model(s). A matrix with dimensions length(selected.terms) x ncol(y) after factors in y have been expanded. Each column holds the coefficients from the GLM regression of that column of y on bx. This duplicates, for convenience, information buried in glm.list.
glm.bpairsNULL unless there are paired binomial columns. A logical vector, derived internally by earth, or a copy the bpairs specified by the user in the glm list. See the "Binomial pairs" section below. The following fields appear only if the nfold argument is greater than 1.
cv.rsq.tabMatrix with nfold+1 rows and nresponse+1 columns, where nresponse is the number of responses, i.e., ncol(y) after factors in y have been expanded. The first nresponse elements of a row are the RSq's on the left-out data for each response of the model generated at that row's fold. The final column holds the row mean (a weighted mean if wp if specified). The final row of the table holds the column means. The values in this final row are the CV-RSqs printed by summary.earth. Example for a single response model (where the mean row and columns are redundant but included for uniformity with multiple response models): y mean fold 1 0.909 0.909 fold 2 0.869 0.869 fold 3 0.952 0.952 fold 4 0.157 0.157 fold 5 0.961 0.961 mean 0.769 0.769 Example for a multiple response model: y1 y2 y3 mean fold 1 0.915 0.951 0.944 0.937 fold 2 0.962 0.970 0.970 0.968 fold 3 0.914 0.940 0.942 0.932 fold 4 0.907 0.929 0.925 0.920 fold 5 0.947 0.987 0.979 0.971 mean 0.929 0.955 0.952 0.946
cv.maxerr.tabLike cv.rsq.tab but is the MaxErr at each fold. This is the signed max absolute value at each fold. Also, results are aggregated for the final column and final row using the signed max absolute value instead of the mean. The signed max absolute value is defined here as the maximum of the absolute difference between the predicted and observed response values, multiplied by -1 if the sign of the difference is negative.
cv.deviance.tabLike cv.rsq.tab but is the MeanDev at each fold. Binomial models only.
cv.calib.int.tabLike cv.rsq.tab but is the CalibInt at each fold. Binomial models only.
cv.calib.slope.tabLike cv.rsq.tab but is the CalibSlope at each fold. Binomial models only.
cv.auc.tabLike cv.rsq.tab but is the AUC at each fold. Binomial models only.
cv.cor.tabLike cv.rsq.tab but is the cor at each fold. Poisson models only.
cv.ntermsVector of length nfold+1. Number of MARS terms in the model generated at each cross validation fold, with the final element being the mean of these.
cv.nvarsVector of length nfold+1. Number of predictors in the model generated at each cross validation fold, with the final element being the mean of these.
cv.groupsSpecifies which cases went into which folds. Vector of length equal to the number of cases, nrow(x), with elements taking values in 1:nfold.
cv.listList of earth models, one model for each fold. These fold models have extra fields cv.rsq and cv.rsq.per.response (and, if keepxy is set, also cv.test.y and cv.test.fitted.values). To save memory, lengthy fields in the fold models are removed, unless you use the keepxy argument. The "lengthy fields" are $bx, $fitted.values, and $residuals.

concept

regression
mars
Friedman

References

The primary references are the Friedman papers. Readers may find the MARS section in Hastie, Tibshirani, and Friedman a more accessible introduction. The Wikipedia article is recommended for an elementary introduction. Faraway takes a hands-on approach, using the ozone data to compare mda::mars with other techniques. (If you use Faraway's examples with earth instead of mars, use $bx instead of $x, and make sure you read the book's errata.) Friedman and Silverman is recommended background reading for the MARS paper. Earth's pruning pass uses the leaps package which is based on techniques in Miller.

Faraway (2005) Extending the Linear Model with R http://www.maths.bath.ac.uk/~jjf23

Friedman (1991) Multivariate Adaptive Regression Splines (with discussion) Annals of Statistics 19/1, 1--141 http://www.salfordsystems.com/doc/MARS.pdf

Friedman (1993) Fast MARS Stanford University Department of Statistics, Technical Report 110 http://www.milbo.users.sonic.net/earth/Friedman-FastMars.pdf, http://www-stat.stanford.edu/research/index.html

Friedman and Silverman (1989) Flexible Parsimonious Smoothing and Additive Modeling Technometrics, Vol. 31, No. 1. http://links.jstor.org/sici?sici=0040-1706%28198902%2931%3A1%3C3%3AFPSAAM%3E2.0.CO%3B2-Z

Hastie, Tibshirani, and Friedman (2009) The Elements of Statistical Learning (2nd ed.) http://www-stat.stanford.edu/~hastie/pub.htm

Leathwick, J.R., Rowe, D., Richardson, J., Elith, J., & Hastie, T. (2005) Using multivariate adaptive regression splines to predict the distributions of New Zealand's freshwater diadromous fish Freshwater Biology, 50, 2034-2052 http://www-stat.stanford.edu/~hastie/pub.htm, http://www.botany.unimelb.edu.au/envisci/about/staff/elith.html

Miller, Alan (1990, 2nd ed. 2002) Subset Selection in Regression http://www.cmis.csiro.au/Alan_Miller/index.html

Wikipedia article on MARS http://en.wikipedia.org/wiki/Multivariate_adaptive_regression_splines

Examples

Run this code

a <- earth(Volume ~ ., data = trees)
summary(a, digits = 2, style = "pmax")

# yields:
# Call: earth(formula=Volume~., data=trees)
#
# Volume =
#   27
#   +  6.2 * pmax(0,  Girth -     14)
#   -  3.3 * pmax(0,     14 -  Girth)
#   + 0.49 * pmax(0, Height -     72)
#
# Selected 4 of 6 terms, and 2 of 2 predictors
# Importance: Girth, Height
# Number of terms at each degree of interaction: 1 3 (additive model)
# GCV 11    RSS 197    GRSq 0.96    RSq 0.98

Run the code above in your browser using DataLab