earth
Multivariate Adaptive Regression Splines
Build a regression model using the techniques in Friedman's papers "Multivariate Adaptive Regression Splines" and "Fast MARS".
See the package vignette
- Keywords
- models, regression, smooth
Usage
## S3 method for class 'formula':
earth(formula = stop("no 'formula' argument"), data = NULL,
weights = NULL, wp = NULL, subset = NULL,
na.action = na.fail,
pmethod = c("backward", "none", "exhaustive", "forward", "seqrep", "cv"),
keepxy = FALSE, trace = 0, glm = NULL, degree = 1, nprune = NULL,
ncross=1, nfold=0, stratify=TRUE,
varmod.method = "none", varmod.exponent = 1,
varmod.conv = 1, varmod.clamp = .1, varmod.minspan = -3,
Scale.y = (NCOL(y)==1), ...)## S3 method for class 'default':
earth(x = stop("no 'x' argument"), y = stop("no 'y' argument"),
weights = NULL, wp = NULL, subset = NULL,
na.action = na.fail,
pmethod = c("backward", "none", "exhaustive", "forward", "seqrep", "cv"),
keepxy = FALSE, trace = 0, glm = NULL, degree = 1, nprune = NULL,
ncross=1, nfold=0, stratify=TRUE,
varmod.method = "none", varmod.exponent = 1,
varmod.conv = 1, varmod.clamp = .1, varmod.minspan = -3,
Scale.y = (NCOL(y)==1), ...)
## S3 method for class 'fit':
earth(x = stop("no 'x' argument"), y = stop("no 'y' argument"),
weights = NULL, wp = NULL, subset = NULL,
na.action = na.fail,
pmethod = c("backward", "none", "exhaustive", "forward", "seqrep", "cv"),
keepxy = FALSE, trace = 0, glm = NULL, degree = 1,
penalty = if(degree > 1) 3 else 2,
nk = min(200, max(20, 2 * ncol(x))) + 1,
thresh = 0.001, minspan = 0, endspan = 0,
newvar.penalty = 0, fast.k = 20, fast.beta = 1,
linpreds = FALSE, allowed = NULL,
nprune = NULL, Object = NULL,
Scale.y = (NCOL(y)==1), Adjust.endspan = 2, Force.weights = FALSE,
Use.beta.cache = TRUE, Force.xtx.prune = FALSE,
Get.leverages = NROW(x) < 1e5, Exhaustive.tol = 1e-10, ...)
Arguments
- formula
- Model formula.
- data
- Data frame for
formula
. - x
- Matrix or dataframe containing the independent variables.
- y
- Vector containing the response variable, or, in the case of multiple responses, a matrix or dataframe whose columns are the values for each response.
- subset
- Index vector specifying which cases to use, i.e., which rows in
x
to use. Default is NULL, meaning all. - weights
- Case weights.
Default is NULL, meaning no case weights.
If specified,
weights
must have length equal tonrow(x)
before applyingsubset
. Zero weights are converted to a very small nonzero value. - wp
- Response weights.
Default is NULL, meaning no response weights.
If specified,
wp
must have an element for each column ofy
(afterfactors
iny
- na.action
- NA action. Default is
na.fail
, and onlyna.fail
is supported. - keepxy
- Default is
FALSE
. Set toTRUE
to retain the following in the returned value:x
andy
(ordata
),subset
, andweights
. The function - trace
- Trace
earth
's execution. Default is0
. Values:0
no tracing.3
variance model (thevarmod.method
arg).5
cross validation (thenfold
arg)1
overview - glm
- NULL (default) or a list of arguments to pass on to
glm
. See the documentation ofglm
for a description of these arguments SeeGeneralized line - degree
- Maximum degree of interaction (Friedman's $mi$).
Default is
1
, meaning build an additive model (i.e., no interaction terms). - penalty
- Generalized Cross Validation (GCV) penalty per knot.
Default is
if(degree>1) 3 else 2
. Simulation studies suggest values in the range of about2
to4
. The FAQ section in the vignette has some information - nk
- Maximum number of model terms before pruning, i.e., the
maximum number of terms created by the forward pass.
Includes the intercept.
The actual number of terms created by the forward pass will often be
less than
nk
because of o - thresh
- Forward stepping threshold.
Default is
0.001
. This is one of the arguments used to decide when forward stepping should terminate: the forward pass terminates if adding a term changes RSq by less thanthresh
. - minspan
- Minimum number of observations between knots.
(This increases resistance to runs of correlated noise in the input data.)
The default
minspan=0
is treated specially and means calculate theminspan
internally, as per Fr - endspan
- Minimum number of observations before the first and after the final knot.
The default
endspan=0
is treated specially and means calculate theminspan
internally, as per the MARS paper equation 45 with $alpha$ = 0.05. S - newvar.penalty
- Penalty for adding a new variable in the forward pass
(Friedman's $gamma$, equation 74 in the MARS paper).
Default is
0
, meaning no penalty for adding a new variable. Useful non-zero values typically range from about0.01
- fast.k
- Maximum number of parent terms considered at each step of the forward pass.
(This speeds up the forward pass. See the Fast MARS paper section 3.0.)
Default is
20
. A value of0
is treated specially (as being equival - fast.beta
- Fast MARS ageing coefficient, as described in the
Fast MARS paper section 3.1.
Default is
1
. A value of0
sometimes gives better results. - linpreds
- Index vector specifying which predictors should enter linearly, as in
lm
. The default isFALSE
, meaning all predictors enter in the standard MARS fashion, i.e., in hinge functions. This d - allowed
- Function specifying which predictors can interact and how.
Default is NULL, meaning all standard MARS terms are allowed.
During the forward pass,
earth
calls theallowed
function before considering a term for inclusion; - pmethod
- Pruning method.
One of:
backward none exhaustive forward seqrep cv
. Default is"backward"
. New in version 4.4.0: Specifypmethod="cv"
to use cross-validation to select the number of terms. This - nprune
- Maximum number of terms (including intercept) in the pruned model. Default is NULL, meaning all terms created by the forward pass (but typically not all terms will remain after pruning). Use this to enforce an upper bound on the model size
- ncross
- Only applies if
nfold>1
. Number of cross-validations. Each cross-validation hasnfold
folds. Default1
. - nfold
- Number of cross-validation folds.
Default is
0
, no cross validation. If greater than1
,earth
first builds a standard model as usual with all the data. It then buildsnfold
cross-validated - stratify
- Only applies if
nfold>1
. Default isTRUE
. Stratify the cross-validation samples so that an approximately equal number of cases with a non-zero response occur in each cross validation subset. So if the res - varmod.method
- Construct a variance model.
For details, see
varmod
and the vignette . Use../doc/earth-varmod.pdf {Variance models in earth}trace=.3
to trace construction of the va - varmod.exponent
- Power transform applied to the rhs before regressing the
absolute residuals with the specified
varmod.method
. Default is1
. For example, withvarmod.method="lm"
, if you expect the standard deviance to increase linear - varmod.conv
- Convergence criterion for the Iteratively Reweighted Least Squares used
when creating the variance model.
Iterations stop when the mean value of the coefficients of the
residual model change by less than
varmod.conv
percent. Default is - varmod.clamp
- The estimated standard deviation of the main model errors
is forced to be at least a small positive value,
which we call
min.sd
. This prevents negative or absurdly small estimated standard deviations. Clamping takes place inpredict.var
- varmod.minspan
- Only applies when
varmod.method="earth"
or"x.earth"
. This is theminspan
used in the internal call toearth
when creating the variance model (not the mainearth
model). Default is-3<
- Object
- Earth object to be updated, for use by
update.earth
. - Scale.y
Scale
y
in the forward pass for better numeric stability. Scaling here means subtract the mean and divide by the standard deviation. Default isNCOL(y)==1
, i.e., scaley<
- Adjust.endspan
- New in version 4.2.0.
In interaction terms,
endspan
gets multiplied by this value. This reduces the possibility of an overfitted interaction term supported by just a few cases on the boundary of the predictor space (as sometimes seen i - Force.weights
- Default is
FALSE
. For testing theweights
argument. Force use of the code for handling weights in theearth
code, even ifweights=NULL
or all the weights are the same. This will not necessarily generate - Use.beta.cache
- Default is
TRUE
. Using thebeta cache takes a little more memory but is faster (by 20% and often much more for large models). The beta cache usesnk * nk * ncol(x) * sizeof(double)
bytes. (The beta cache is an in - Force.xtx.prune
- Default is
FALSE
. This argument pertains to subset evaluation in the pruning pass. By default, ify
has a single column thenearth
calls theleaps
routines; if - Get.leverages
- New in version 4.4.0.
Default is
TRUE
unless the model has more than 100 thousand cases. The leverages are the diagonal hat values for the linear regression ofy
onbx
. The leverages are needed only for certain model - Exhaustive.tol
- Default
1e-10
. Applies only whenpmethod="exhaustive"
. If the reciprocal of the condition number ofbx
is less thanExhaustive.tol
,earth
forcespmethod="backward"
. See - ...
- Dots are passed on to
earth.fit
.
Value
- An object of class
"earth"
which is a list with the components listed below. Term refers to a term created during the forward pass (each line of the output fromformat.earth
is a term). Term number 1 is always the intercept. rss
Residual sum-of-squares (RSS) of the model (summed over all responses, if y
has multiple columns).rsq
1-rss/tss
. R-Squared of the model (calculated over all responses, and calculated using theweights
argument if it was supplied). A measure of how well the model fits the training data. Note thattss
is the total sum-of-squares,sum((y - mean(y))^2)
.gcv
Generalized Cross Validation (GCV) of the model (summed over all responses). The GCV is calculated using the penalty
argument. For details of the GCV calculation, see equation 30 in Friedman's MARS paper andearth:::get.gcv
.grsq
1-gcv/gcv.null
. An estimate of the predictive power of the model (calculated over all responses, and calculated using theweights
argument if it was supplied).gcv.null
is the GCV of an intercept-only model. SeeCan in the vignette.GRSq
be negative?bx
Matrix of basis functions applied to x
. Each column corresponds to a selected term. Each row corresponds to a row in in the input matrixx
, after takingsubset
. Seemodel.matrix.earth
for an example ofbx
handling. Examplebx
:(Intercept) h(Girth-12.9) h(12.9-Girth) h(Girth-12.9)*h(... [1,] 1 0.0 4.6 0 [2,] 1 0.0 4.3 0 [3,] 1 0.0 4.1 0 ...dirs
Matrix with one row per MARS term, and with with ij-th element equal to 0
if predictor j is not in term i-1
if an expression of the formh(const - xj)
is in term i1
if an expression of the formh(xj - const)
is in term i2
if predictor j should enter term i linearly (either because specified by thelinpreds
argument or because earth discovered that a knot was unnecessary). This matrix includes all terms generated by the forward pass, including those not inselected.terms
. Note that here the terms may not all be in pairs, because although the forward pass add terms as hinged pairs (so both sides of the hinge are available as building blocks for further terms), it also deletes linearly dependent terms before handing control to the pruning pass. Exampledirs
:Girth Height (Intercept) 0 0 #intercept h(12.9-Girth) -1 0 #2nd term uses Girth h(Girth-12.9) 1 0 #3rd term uses Girth h(Girth-12.9)*h(Height-76) 1 1 #4th term uses Girth and Height ...cuts
Matrix with ij-th element equal to the cut point for predictor j in term i. This matrix includes all terms generated by the forward pass, including those not in selected.terms
. Note for programmers: the precedent is to usedirs
for term names etc. and to only usecuts
where cut information needed. Examplecuts
:Girth Height (Intercept) 0 0 #intercept, no cuts h(12.9-Girth) 12.9 0 #2nd term has cut at 12.9 h(Girth-12.9) 12.9 0 #3rd term has cut at 12.9 h(Girth-12.9)*h(Height-76) 12.9 76 #4th term has two cuts ...prune.terms
A matrix specifying which terms appear in which pruning pass subsets. The row index of prune.terms
is the model size. (The model size is the number of terms in the model. The intercept is counted as a term.) Each row is a vector of term numbers for the best model of that size. An element is 0 if the term is not in the model, thusprune.terms
is a lower triangular matrix, with dimensionsnprune x nprune
. The model selected by the pruning pass is at row numberlength(selected.terms)
. Exampleprune.terms
:[1,] 1 0 0 0 0 0 0 #intercept-only model [2,] 1 2 0 0 0 0 0 #best 2 term model uses terms 1,2 [3,] 1 2 4 0 0 0 0 #best 3 term model uses terms 1,2,4 [4,] 1 2 6 9 0 0 0 #and so on ...selected.terms
Vector of term numbers in the selected model. Can be used as a row index vector into cuts
anddirs
. The first elementselected.terms[1]
is always 1, the intercept.fitted.values
Fitted values. A matrix with dimensions nrow(y)
xncol(y)
after factors iny
have been expanded.residuals
Residuals. A matrix with dimensions nrow(y)
xncol(y)
after factors iny
have been expanded.coefficients
Regression coefficients. A matrix with dimensions length(selected.terms)
xncol(y)
after factors iny
have been expanded. Each column holds the least squares coefficients from regressing that column ofy
onbx
. The first row holds the intercept coefficient(s).rss.per.response
A vector of the RSS for each response. Length is the number of responses, i.e., ncol(y)
after factors iny
have been expanded. Therss
component above is equal tosum(rss.per.response)
.rsq.per.response
A vector of the R-Squared for each response (where R-Squared is calculated using the weights
argument if it was supplied). Length is the number of responses.gcv.per.response
A vector of the GCV for each response. Length is the number of responses. The gcv
component above is equal tosum(gcv.per.response)
.grsq.per.response
A vector of the GRSq for each response (calculated using the weights
argument if it was supplied). Length is the number of responses.rss.per.subset
A vector of the RSS for each model subset generated by the pruning pass. Length is nprune
. For multiple responses, the RSS is summed over all responses for each subset. Therss
above isrss.per.subset[length(selected.terms)]
. The RSS of an intercept only-model isrss.per.subset[1]
.gcv.per.subset
A vector of the GCV for each model in prune.terms
. Length isnprune
. For multiple responses, the GCV is summed over all responses for each subset. Thegcv
above isgcv.per.subset[length(selected.terms)]
. The GCV of an intercept-only model isgcv.per.subset[1]
.leverages
Diagonal of the hat matrix (from the linear regression of the response on bx
).penalty,nk,thresh
Copies of the corresponding arguments to earth
.pmethod,nprune
Copies of the corresponding arguments to earth
.weights,wp
Copies of the corresponding arguments to earth
.termcond
Reason the forward pass terminated (an integer). call
The call used to invoke earth
.terms
Model frame terms. This component exists only if the model was built using earth.formula
.namesx
Column names of x
, generated internally byearth
when necessary so each column ofx
has a name. Used, for example, bypredict.earth
to name columns if necessary.namesx.org
Original column names of x
.levels
Levels of y
ify
is afactor
c(FALSE,TRUE)
ify
islogical
Else NULL The following fields appear only ifearth
's argumentkeepxy
isTRUE
.x
,y
,data
,subset
Copies of the corresponding arguments to earth
. Only exist ifkeepxy=TRUE
. The following fields appear only ifearth
'sglm
argument is used.glm.list
List of GLM models. Each element is the value returned by earth
's internal call toglm
for each response. Thus if there is a single response (or a single binomial pair, seeBinomial pairs in the vignette) this will be a one element list and you access the GLM model withearth.mod$glm.list[[1]]
.glm.coefficients
GLM regression coefficients. Analogous to the coefficients
field described above but for the GLM model(s). A matrix with dimensionslength(selected.terms)
xncol(y)
after factors iny
have been expanded. Each column holds the coefficients from the GLM regression of that column ofy
onbx
. This duplicates, for convenience, information buried inglm.list
.glm.bpairs
NULL unless there are paired binomial columns. A logical vector, derived internally by earth
, or a copy thebpairs
specified by the user in theglm
list. SeeBinomial pairs in the vignette. The following fields appear only if thenfold
argument is greater than 1.cv.list
List of earth
models, one model for each fold (ncross * nfold
models). The fold models have two extra fields,icross
(an integer from1
toncross
) andifold
(an integer from1
tonfold
). To save memory, lengthy fields in the fold models are removed unless you usekeepxy=TRUE
. Thelengthy fields are$bx
,$fitted.values
, and$residuals
.cv.nterms
Vector of length ncross * nfold + 1
. Number of MARS terms in the model generated at each cross-validation fold, with the final element being the mean of these.cv.nvars
Vector of length ncross * nfold + 1
. Number of predictors in the model generated at each cross-validation fold, with the final element being the mean of these.cv.groups
Specifies which cases went into which folds. Matrix with two columns and number of rows equal to the the number of cases nrow(x)
Elements of the first column specify the cross-validation number,1:ncross
. Elements of the second column specify the fold number,1:nfold
.cv.rsq.tab
Matrix with ncross * nfold + 1
rows andnresponse+1
columns, wherenresponse
is the number of responses, i.e.,ncol(y)
after factors iny
have been expanded. The firstnresponse
elements of a row are thecv.rsq
's on the out-of-fold data for each response of the model generated at that row's fold. (Acv.rsq
is calculated from predictions on the out-of-fold data using the best model built from the in-fold data; wherebest means the model was selected using the in-fold GCV. The R-Squareds are calculated using theweights
argument if it was supplied. The final column holds the row mean (a weighted mean ifwp
if specified)). The final row holds the column means. The values in this final row is the meancv.rsq
printed bysummary.earth
. Example for a single response model (where themean
column is redundant but included for uniformity with multiple response models): y mean fold1 0.909 0.909 fold2 0.869 0.869 fold3 0.952 0.952 fold4 0.157 0.157 fold5 0.961 0.961 mean 0.769 0.769 Example for a multiple response model: y1 y2 y3 mean fold1 0.915 0.951 0.944 0.937 fold2 0.962 0.970 0.970 0.968 fold3 0.914 0.940 0.942 0.932 fold4 0.907 0.929 0.925 0.920 fold5 0.947 0.987 0.979 0.971 mean 0.929 0.955 0.952 0.946cv.class.rate.tab
Like cv.rsq.tab
but is the classification rate at each fold i.e. the fraction of classes correctly predicted. Models with discrete response only. Calculated withthresh=.5
for binary responses. For responses with more than two levels, the final row is the overall classification rate. The other rows are the classification rates for each level (the level versus not-the-level), which are usually higher than the overall classification rate (predicting the level versus not-the-level is easier than correctly predicting one of many levels). Theweights
argument is ignored for all cross-validation stats except R-Squareds.cv.maxerr.tab
Like cv.rsq.tab
but is theMaxErr
at each fold. This is the signed max absolute value at each fold. Results are aggregated for the final column and final row using the signed max absolute value. The signed max absolute value is defined as the maximum of the absolute difference between the predicted and observed response values, multiplied by-1
if the sign of that difference is negative.cv.auc.tab
Like cv.rsq.tab
but is theAUC
at each fold. Binomial models only.cv.cor.tab
Like cv.rsq.tab
but is thecor
at each fold. Poisson models only.cv.deviance.tab
Like cv.rsq.tab
but is theMeanDev
at each fold. Binomial models only.cv.calib.int.tab
Like cv.rsq.tab
but is theCalibInt
at each fold. Binomial models only.cv.calib.slope.tab
Like cv.rsq.tab
but is theCalibSlope
at each fold. Binomial models only.cv.oof.rsq.tab
Generated only if keepxy=TRUE
orpmethod="cv"
. A matrix withncross * nfold + 1
rows andmax.nterms
columns, Each element holds an out-of-fold RSq (oof.rsq
), calculated from predictions from the out-of-fold observations using the model built with the in-fold data. The final row is the mean over all folds. The R-Squareds are calculated using theweights
argument if it was supplied.cv.infold.rsq.tab
Generated only if keepxy=TRUE
. Likecv.oof.rsq.tab
but from predictions made on the in-fold observations.cv.oof.fit.tab
Generated only if the varmod.method
argument is used. Predicted values on the out-of-fold data. Dataframe withnrow(data)
rows andncross
columns. The following field appears only if thevarmod.method
is specified.varmod
An object of class "varmod"
. See thevarmod
help page for a description. Only appears if thevarmod.method
argument is used.
concept
- regression
- mars
- Friedman
References
The primary references are the Friedman papers.
Readers may find the MARS section in Hastie, Tibshirani,
and Friedman a more accessible introduction.
The Wikipedia article is recommended for an elementary introduction.
Faraway takes a hands-on approach,
using the ozone
data to compare mda::mars
with other techniques.
(If you use Faraway's examples with earth
instead of mars
, use $bx
instead of $x
, and check out the book's errata.)
Friedman and Silverman is recommended background reading for the MARS paper.
Earth's pruning pass uses code from the leaps
package
which is based on techniques in Miller.
Faraway (2005) Extending the Linear Model with R
Friedman (1991) Multivariate Adaptive Regression Splines (with discussion)
Annals of Statistics 19/1, 1--141
Friedman and Silverman (1989)
Flexible Parsimonious Smoothing and Additive Modeling
Technometrics, Vol. 31, No. 1.
Hastie, Tibshirani, and Friedman (2009) The Elements of Statistical Learning (2nd ed.)
Leathwick, J.R., Rowe, D., Richardson, J., Elith, J., & Hastie, T. (2005)
Using multivariate adaptive regression splines to predict the distributions
of New Zealand's freshwater diadromous fish Freshwater Biology, 50, 2034-2052
Miller, Alan (1990, 2nd ed. 2002) Subset Selection in Regression
Wikipedia article on MARS
See Also
Start with summary.earth
, plot.earth
,
evimp
, and plotmo
.
Please see the main package vignette
The vignette
earth
models.
Examples
earth.mod <- earth(Volume ~ ., data = trees)
plotmo(earth.mod)
summary(earth.mod, digits = 2, style = "pmax")