See the package vignette
## S3 method for class 'formula':
earth(formula = stop("no 'formula' arg"), data = NULL,
weights = NULL, wp = NULL, subset = NULL,
na.action = na.fail, keepxy = FALSE, trace = 0, glm = NULL,
ncross=1, nfold=0, stratify=TRUE,
varmod.method = "none", varmod.exponent = 1,
varmod.conv = 1, varmod.clamp = .1, varmod.minspan = -5,
Scale.y = (NCOL(y)==1), ...)## S3 method for class 'default':
earth(x = stop("no 'x' arg"), y = stop("no 'y' arg"),
weights = NULL, wp = NULL, subset = NULL,
na.action = na.fail, keepxy = FALSE, trace = 0, glm = NULL,
ncross=1, nfold=0, stratify=TRUE,
varmod.method = "none", varmod.exponent = 1,
varmod.conv = 1, varmod.clamp = .1, varmod.minspan = -5,
Scale.y = (NCOL(y)==1), ...)
## S3 method for class 'fit':
earth(x = stop("no 'x' arg"), y = stop("no 'y' arg"),
weights = NULL, wp = NULL, subset = NULL,
na.action = na.fail, keepxy = FALSE, trace = 0, glm = NULL,
degree = 1, penalty = if(degree > 1) 3 else 2,
nk = min(200, max(20, 2 * ncol(x))) + 1,
thresh = 0.001, minspan = 0, endspan = 0,
newvar.penalty = 0, fast.k = 20, fast.beta = 1,
linpreds = FALSE, allowed = NULL,
pmethod = c("backward", "none", "exhaustive", "forward", "seqrep"),
nprune = NULL,
Object = NULL, Get.crit = get.gcv, Eval.model.subsets = eval.model.subsets,
Scale.y = (NCOL(y)==1), Force.xtx.prune = FALSE, Force.weights = FALSE,
Use.beta.cache = TRUE, Exhaustive.tol = 1e-10, ...)
formula
.x
to use.
Default is NULL, meaning all.trace=-1
.
The current implemwp
must have an element for each column of
y
(after factors
in y
, ina.fail
, and only na.fail
is supported.TRUE
to retain the following in the returned value: x
and y
(or data
),
subset
, and weights
.
Default is FALSE
.
The function
earth
's execution. Default is 0. Values:
0 no tracing
0.3 variance model (the varmod.method
arg)
0.5 cross validation (the nfold
arg)
1 overview
2 forward pass
3 pruning
4 model mats, pruning details
5 internif(degree>1) 3 else 2
.
A value of 0 penalizes only terms, not knots.
The value -1 is treated specially to mean no penalty, so GCV=RSS/n.
Simulation studies have snk
because of othresh
.
See minspan=0
is treated specially and
means calculate the minspan
internally as per
Friendspan=0
is treated specially and
means calculate the minspan
internally as per
the MARS paper equation 45 with $alpha$ = 0.05.
Selm
.
The default is FALSE
, meaning all predictors enter
in the standard MARS fashion, i.e., in hinge functions.
This doearth
calls the allowed
function
before considering a term for inclusion;backward none exhaustive forward seqrep
.
Default is "backward"
.
Use none
to retain all the terms created by the forward pass.
If y
has multiple columns, then only backw
nfold>1
.
Number of cross-validations. Each cross-validation has nfold
folds.
Default 1.earth
first builds a standard model as usual with all the data.
It then builds nfold
cross-validated models,
measuring R-nfold>1
.
Default is TRUE
.
Stratify the cross-validation samples so that
an approximately equal number of cases with a non-zero response
occur in each cross validation subset.
So if the resvarmod
and the vignette
trace=.3
to trace construction of tvarmod.method
.
Default is 1.
For example, with varmod.method="lm"
,
if you expect the standard
deviance of the residuals to increase livarmod.conv
percent.
Default is 1 percmin.sd
.
This prevents negative or absurdly small estimated standard deviations.
Clamping varmod.method="earth"
or "x.earth"
.
This is the minspan
used in the internal call to earth
when creating the variance model (not the main earth
model).
Default is -5, i.e.,update.earth
.Scale
y
in the forward pass for better numeric stability.
Scaling here means subtract the mean and divide by the standard deviation.
Default is NCOL(y)==1
,
i.e., scale y<
FALSE
.
This argument pertains to subset evaluation in the pruning pass.
By default,
if y
has a single column then earth
calls the leaps
routines;
if FALSE
.
For testing the weights
argument.
Force use of the code for handling weights in the earth
code,
even if weights=NULL
or all the weights are the same.TRUE
.
Using the nk * nk * ncol(x) * sizeof(double)
bytes.
Set Use.beta.cache=FALSE
1e-10
.
Applies only when pmethod="exhaustive"
.
If the reciprocal of the condition number of bx
is less than Exhaustive.tol
, earth
forces pmethod="backward"
.
See earth.fit
."earth"
which is a list with the components
listed below.
Term refers to a term created during the forward pass
(each line of the output from format.earth
is a term).
Term number 1 is always the intercept.rss
y
has multiple columns).rsq
1-rss/tss
.
R-Squared of the model (calculated over all responses).
A measure of how well the model fits the training data.
Note that tss
is the total sum-of-squares, sum((y - mean(y))^2)
.gcv
penalty
argument.
For details of the GCV calculation, see
equation 30 in Friedman's MARS paper and earth:::get.gcv
.grsq
1-gcv/gcv.null
.
An estimate of the predictive power of the model (calculated over all responses,
gcv.null
is the GCV of an intercept-only model).
See GRSq
be negative?bx
x
.
Each column corresponds to a selected term.
Each row corresponds to a row in in the input matrix x
,
after taking subset
.
See model.matrix.earth
for an example of bx
handling.
Example bx
:(Intercept) h(Girth-12.9) h(12.9-Girth) h(Girth-12.9)*h(...
[1,] 1 0.0 4.6 0
[2,] 1 0.0 4.3 0
[3,] 1 0.0 4.1 0
...dirs
0
if predictor j is not in term i
-1
if an expression of the form h(const - xj)
is in term i
1
if an expression of the form h(xj - const)
is in term i
2
if predictor j should enter term i linearly
(either because specified by the linpreds
argument or because earth
discovered that a knot was unnecessary).
This matrix includes all terms generated by the forward pass,
including those not in selected.terms
.
Note that here the terms may not all be in pairs, because
although the forward pass add terms as hinged pairs (so both sides of
the hinge are available as building blocks for further terms), it also
deletes linearly dependent terms before handing control to the pruning pass.
Example dirs
:Girth Height
(Intercept) 0 0 #intercept
h(12.9-Girth) -1 0 #2nd term uses Girth
h(Girth-12.9) 1 0 #3rd term uses Girth
h(Girth-12.9)*h(Height-76) 1 1 #4th term uses Girth and Height
...cuts
selected.terms
.
Note for programmers: the precedent is to use dirs
for term names etc. and to only use cuts
where cut information needed.
Example cuts
:Girth Height
(Intercept) 0 0 #intercept, no cuts
h(12.9-Girth) 12.9 0 #2nd term has cut at 12.9
h(Girth-12.9) 12.9 0 #3rd term has cut at 12.9
h(Girth-12.9)*h(Height-76) 12.9 76 #4th term has two cuts
...selected.terms
cuts
and dirs
.
The first element selected.terms[1]
is always 1, the intercept.prune.terms
prune.terms
is the model size.
(The model size is the number of terms in the model.
The intercept is counted as a term.)
Each row is a vector of term numbers for the best model of that size.
An element is 0 if the term is not in the model, thus prune.terms
is a
lower triangular matrix, with dimensions nprune x nprune
.
The model selected by the pruning pass is at row number length(selected.terms)
.
Example prune.terms
:[1,] 1 0 0 0 0 0 0 #intercept-only model
[2,] 1 2 0 0 0 0 0 #best 2 term model uses terms 1,2
[3,] 1 2 4 0 0 0 0 #best 3 term model uses terms 1,2,4
[4,] 1 2 6 9 0 0 0 #and so on
...rss.per.response
ncol(y)
after factors in y
have been expanded.
The rss
component above is equal to sum(rss.per.response)
.rsq.per.response
gcv.per.response
gcv
component above is equal to sum(gcv.per.response)
.grsq.per.response
rss.per.subset
nprune
.
For multiple responses, the RSS is summed over all responses for each subset.
The rss
above is
rss.per.subset[length(selected.terms)]
.
The RSS of an intercept only-model is rss.per.subset[1]
.gcv.per.subset
prune.terms
.
Length is nprune
.
For multiple responses, the GCV is summed over all responses for each subset.
The gcv
above is gcv.per.subset[length(selected.terms)]
.
The GCV of an intercept-only model is gcv.per.subset[1]
.fitted.values
nrow(y)
x ncol(y)
after factors in y
have been expanded.residuals
nrow(y)
x ncol(y)
after factors in y
have been expanded.coefficients
length(selected.terms)
x ncol(y)
after factors in y
have been expanded.
Each column holds the least squares coefficients from regressing that
column of y
on bx
.
The first row holds the intercept coefficient(s).penalty,nk,thresh
earth
.weights,wp
earth
.termcond
call
earth
.terms
earth.formula
.namesx
x
, generated internally by earth
when necessary
so each column of x
has a name.
Used, for example, by predict.earth
to name columns if necessary.namesx.org
x
.levels
y
if y
is a factor
c(FALSE,TRUE)
if y
is logical
Else NULL
The following fields appear only if earth
's argument keepxy
is TRUE
.x
y
data
subset
earth
.
Only exist if keepxy=TRUE
.
The following fields appear only if earth
's glm
argument is used.
}glm.list
earth
's
internal call to glm
for each response.
Thus if there is a single response (or a single binomial pair, see
earth.mod$glm.list[[1]]
.glm.coefficients
coefficients
field described above but for the GLM model(s).
A matrix with dimensions length(selected.terms)
x ncol(y)
after factors in y
have been expanded.
Each column holds the coefficients from the GLM regression of that
column of y
on bx
.
This duplicates, for convenience, information buried in glm.list
.glm.bpairs
earth
, or a copy
the bpairs
specified by the user in the glm
list.
See nfold
argument is greater than 1.cv.list
earth
models, one model for each fold (ncross * nfold
models).
To save memory, lengthy fields
in the fold models are removed unless you use keepxy=TRUE
.
The $bx
, $fitted.values
, and $residuals
.
The fold models have two extra fields,
icross
(the cross-validation index, 1:ncross
)
and ifold
(the fold index, 1:nfold
).cv.nterms
ncross * nfold + 1
.
Number of MARS terms in the model generated at each cross-validation fold,
with the final element being the mean of these.cv.nvars
ncross * nfold + 1
.
Number of predictors in the model generated at each cross-validation fold,
with the final element being the mean of these.cv.groups
nrow(x)
Elements of the first column specify the cross-validation number, 1:ncross
.
Elements of the second column specify the fold number, 1:nfold
.cv.rsq.tab
ncross * nfold + 1
rows and nresponse+1
columns,
where nresponse
is the number of responses,
i.e., ncol(y)
after factors in y
have been expanded.
The first nresponse
elements of a row are the cv.rsq
's on
the out-of-fold data for each response of the model generated at that row's fold.
(A cv.rsq
is calculated from predictions on the out-of-fold data
using the best model built from the in-fold data;
where wp
if specified).
The final row holds the column means.
The values in this final row is the mean cv.rsq
printed by summary.earth
.
Example for a single response model (where the mean
column
is redundant but included for uniformity with multiple response models):
y mean
fold1 0.909 0.909
fold2 0.869 0.869
fold3 0.952 0.952
fold4 0.157 0.157
fold5 0.961 0.961
mean 0.769 0.769
Example for a multiple response model:
y1 y2 y3 mean
fold1 0.915 0.951 0.944 0.937
fold2 0.962 0.970 0.970 0.968
fold3 0.914 0.940 0.942 0.932
fold4 0.907 0.929 0.925 0.920
fold5 0.947 0.987 0.979 0.971
mean 0.929 0.955 0.952 0.946cv.oof.rsq.tab
ncross * nfold + 1
rows and max.nterms
columns,
Only calculated and kept if keepxy=TRUE
.
Each element holds an out-of-fold RSq (oof.rsq
),
calculated from predictions from the out-of-fold observations using
the model built with the in-fold data. The final row is the mean over
all folds.cv.infold.rsq.tab
cv.oof.rsq.tab
but from predictions made on the in-fold observations.cv.class.rate.tab
cv.rsq.tab
but is the classification rate at each fold
i.e. the fraction of classes correctly predicted.
Models with discrete response only.
Calculated with thresh=.5
for binary responses.
For responses with more than two
levels, the final row is the overall classification rate. The other
rows are the classification rates for each level (the level versus
versus not-the-level), which are usually higher than the overall
classification rate (predicting the level versus not-the-level is
easier than correctly predicting one of many levels).cv.maxerr.tab
cv.rsq.tab
but is the MaxErr
at each fold.
This is the signed max absolute value at each fold.
Results are aggregated for the final column and final row
using the signed max absolute value.
The signed max absolute value is defined
as the maximum of the absolute difference
between the predicted and observed response values, multiplied
by -1
if the sign of that difference is negative.cv.auc.tab
cv.rsq.tab
but is the AUC
at each fold.
Binomial models only.cv.cor.tab
cv.rsq.tab
but is the cor
at each fold.
Poisson models only.cv.deviance.tab
cv.rsq.tab
but is the MeanDev
at each fold.
Binomial models only.cv.calib.int.tab
cv.rsq.tab
but is the CalibInt
at each fold.
Binomial models only.cv.calib.slope.tab
cv.rsq.tab
but is the CalibSlope
at each fold.
Binomial models only.cv.oof.fit.tab
varmod.method
argument is used.
Predicted values on the out-of-fold data.
Dataframe with nrow(data)
rows and ncross
columns.
The following field appears only if the varmod.methis
is specified.varmod
"varmod"
.
See the varmod
help page for a description.
Only appears if the varmod.method
argument is used.ozone
data to compare mda::mars
with other techniques.
(If you use Faraway's examples with earth
instead of mars
, use $bx
instead of $x
, and check out the book's errata.)
Friedman and Silverman is recommended background reading for the MARS paper.
Earth's pruning pass uses code from the leaps
package
which is based on techniques in Miller. Faraway (2005) Extending the Linear Model with R
Friedman (1991) Multivariate Adaptive Regression Splines (with discussion)
Annals of Statistics 19/1, 1--141
Friedman and Silverman (1989)
Flexible Parsimonious Smoothing and Additive Modeling
Technometrics, Vol. 31, No. 1.
Hastie, Tibshirani, and Friedman (2009) The Elements of Statistical Learning (2nd ed.)
Leathwick, J.R., Rowe, D., Richardson, J., Elith, J., & Hastie, T. (2005)
Using multivariate adaptive regression splines to predict the distributions
of New Zealand's freshwater diadromous fish Freshwater Biology, 50, 2034-2052
Miller, Alan (1990, 2nd ed. 2002) Subset Selection in Regression
Wikipedia article on MARS
summary.earth
, plot.earth
,
evimp
, and plotmo
.Please see the main package vignette
The vignette
earth
models.
earth.mod <- earth(Volume ~ ., data = trees)
plotmo(earth.mod)
summary(earth.mod, digits = 2, style = "pmax")
Run the code above in your browser using DataLab