## S3 method for class 'formula':
earth(formula = stop("no 'formula' arg"),
data, weights = NULL, wp = NULL, scale.y = (NCOL(y)==1), subset = NULL,
na.action = na.fail, glm = NULL, trace = 0,
keepxy = FALSE, nfold=0, stratify=TRUE, ...)## S3 method for class 'default':
earth(x = stop("no 'x' arg"), y = stop("no 'y' arg"),
weights = NULL, wp = NULL, scale.y = (NCOL(y)==1), subset = NULL,
na.action = na.fail, glm = NULL, trace = 0,
keepxy = FALSE, nfold=0, stratify=TRUE, ...)
## S3 method for class 'fit':
earth(x = stop("no 'x' arg"), y = stop("no 'y' arg"),
weights = NULL, wp = NULL, scale.y = (NCOL(y)==1), subset = NULL,
na.action = na.fail, glm = NULL, trace = 0,
penalty = if(degree > 1) 3 else 2, nk = max(21, 2 * ncol(x) + 1),
degree = 1, linpreds = FALSE, allowed = NULL, thresh = 0.001,
minspan = 0, newvar.penalty = 0, fast.k = 20, fast.beta = 1,
pmethod = "backward", nprune = NULL, Object = NULL,
Get.crit = get.gcv, Eval.model.subsets = eval.model.subsets,
Print.pruning.pass = print.pruning.pass, Force.xtx.prune = FALSE,
Use.beta.cache = TRUE, ...)
formula
.x
to use.
Default is NULL, meaning all.wp
must have an element for each column of
y
(after factors
, if any, hScale
y
in the forward pass for better numeric stability
[added Jan 2009].
Scaling here means subtract the mean and divide by the standard deviation.
Default is NCO
na.fail
, and only na.fail
is supported.
(Why? Because adding support to earth
for other NA actions is easy, but making
sure that is handled correctly internally in predict
glm
.
See the "Generalized linear models" section below. Example:
earth(y~x, glm=list(family=binomial))
0 none .5 cross validation 1 overview 2 forward pass 3 pruning 4 model mats, memory use, more pruning, etc. 5 ...
x
and y
(or data
),
subset
, and weights
.
Default is FALSE.
The function up
if(degree>1) 3 else 2
.
A value of 0 penalizes only terms, not knots.
The value -1 is a special case, meaning no penalty, so GCV=RSS/n.
Theory suggests values in tmax(21,2*NCOL(x)+1)
.
The number of terms created by the forward pass will be
less than nk
if a forward stopping condition is realm
.
The default is FALSE, meaning all predictors enter
in the standard MARS fashion, i.e., in hinge functions.
A predictor's index in allowed
function just before adding a term
in the forward pass.
If allowed
reminspan
internally as per
Frnfold
cross validated models,
measuring R-Squarenfold>1
.
Default is TRUE.
Stratify the cross validation samples so that, for each column of the response
y
(after factors have been expanded),
an approximately equal number of cases with a non-"backward"
.
One of: backward none exhaustive forward seqrep
.
Use none
to retain all the terms created by the forward pass.
If y
has multiple columns, then oupdate.earth
.y
has a single column then earth
calls the leaps
routines;
if <nk * nk * ncol(x) * sizeof(double)
bytes.
Set Use.beta.cache=FALSE
to saveearth.fit
.format.earth
is a term).
Term number 1 is always the intercept.y
has multiple columns).1-rss/rss.null
.
R-Squared of the model (calculated over all responses).
A measure of how well the model fits the training data.
Note that rss.null
is sum((y - mean(y))^2)
.penalty
argument.
For details of the GCV calculation, see
equation 30 in Friedman's MARS paper and earth:::get.gcv
.1-gcv/gcv.null
.
An estimate of the predictive power of the model (calculated over all responses).
Unlike rsq
, in MARS models grsq
can be negative.
A negative grsq
would indicate
a severely over parameterized model --- a model that
would not generalize well
even though it may be a good fit to the training data.
Watch the GRSq take a nose dive in this example:
earth(mpg~., data=mtcars, pmethod="none", trace=3)
x
.
Each column corresponds to a selected term.
Each row corresponds to a row in in the input matrix x
,
after taking subset
.
See model.matrix.earth
for an example of bx
handling.
Example bx
:(Intercept) h(Girth-12.9) h(12.9-Girth) h(Girth-12.9)*h(...
[1,] 1 0.0 4.6 0
[2,] 1 0.0 4.3 0
[3,] 1 0.0 4.1 0
...0
if predictor j is not in term i
-1
if an expression of the form pmax(const - xj)
is in term i
1
if an expression of the form pmax(xj - const)
is in term i
2
if predictor j enters term i linearly.
This matrix includes all terms generated by the forward.pass,
including those not in selected.terms
.
Note that the terms may not be in pairs, because the forward pass
deletes linearly dependent terms before handing control to the pruning pass.
Example dirs
:Girth Height
(Intercept) 0 0 #intercept
h(Girth-12.9) 1 0 #2nd term uses Girth
h(12.9-Girth) -1 0 #3rd term uses Girth
h(Girth-12.9)*h(Height-76) 1 1 #4th term uses Girth and Height
...selected.terms
.
Note that the terms may not be in pairs, because the forward pass
deletes linearly dependent terms before handing control to the pruning pass.
Note for programmers: the precedent is to use dirs
for term names etc. and to only use cuts
where cut information needed.
Example cuts
:Girth Height
(Intercept) 0 0 #intercept, no cuts
h(Girth-12.9) 12.9 0 #2nd term has cut at 12.9
h(12.9-Girth) 12.9 0 #3rd term has cut at 12.9
h(Girth-12.9)*h(Height-76) 12.9 76 #4th term has two cuts
...cuts
and dirs
.
The first element selected.terms[1]
is always 1, the intercept.prune.terms
is the model size.
(The model size is the number of terms in the model.
The intercept is considered to be a term.)
Each row is a vector of term numbers for the best model of that size.
An element is 0 if the term is not in the model, thus prune.terms
is a
lower triangular matrix, with dimensions nprune x nprune
.
The model selected by the pruning pass is at row number length(selected.terms)
.
Example prune.terms
:[1,] 1 0 0 0 0 0 0 #intercept-only model
[2,] 1 2 0 0 0 0 0 #best 2 term model uses terms 1,2
[3,] 1 2 4 0 0 0 0 #best 3 term model uses terms 1,2,4
[4,] 1 2 6 9 0 0 0 #and so on
...ncol(y)
after factors in y
have been expanded.
The rss
component above is equal to sum(rss.per.response)
.gcv
component above is equal to sum(gcv.per.response)
.nprune
.
For multiple responses, the RSS is summed over all responses for each subset.
The null RSS (i.e., the RSS of an intercept only-model) is rss.per.subset[1]
.
The rss
above is
rss.per.subset[length(selected.terms)]
.prune.terms
.
Length is is nprune
.
For multiple responses, the GCV is summed over all responses for each subset.
The null GCV (i.e., the GCV of an intercept-only model) is gcv.per.subset[1]
.
The gcv
above is gcv.per.subset[length(selected.terms)]
.nrow(y)
x ncol(y)
after factors in y
have been expanded.nrow(y)
x ncol(y)
after factors in y
have been expanded.length(selected.terms)
x ncol(y)
after factors in y
have been expanded.
Each column holds the least squares coefficients from regressing that
column of y
on bx
.
The first row holds the intercept coefficient(s).earth
's penalty
argument.earth
.earth.formula
.x
, generated internally by earth when necessary
so each column of x
has a name.
Used, for example, by predict.earth
to name columns if necessary.x
.y
if y
is a factor
c(FALSE,TRUE)
if y
is logical
Else NULLwp
argument to earth.
The following fields appear only if earth's argument keepxy
is TRUE.keepxy=TRUE
.
The following fields appear only if earth's glm
argument is used.glm
for each response.
Thus if there is a single response (or a single binomial pair, see
the "Binomial pairs" section below)
this will be a one element list and you access the GLM model with
my.earth.model$glm.list[[1]].coefficients
field described above but for the GLM model(s).
A matrix with dimensions length(selected.terms)
x ncol(y)
after factors in y
have been expanded.
Each column holds the coefficients from the GLM regression of that
column of y
on bx
.
This duplicates, for convenience, information in glm.list
.bpairs
specified by the user in the glm
list.
See the "Binomial pairs" section below.
The following fields appear only if the nfold
argument is greater than 1.nfold+1
rows and nresponse+1
columns,
where nresponse
is the number of responses,
i.e., ncol(y)
after factors in y
have been expanded.
The first nresponse
elements of a row are the RSq's on
the left-out data for each response of the model generated at that row's fold.
The final column holds the row mean (a weighted mean if wp
if specified).
The final row of the table holds the column means.
The values in this row are the CV-RSqs printed by summary.earth.
Example for a single response model:
y mean
fold 1 0.909 0.909
fold 2 0.869 0.869
fold 3 0.952 0.952
fold 4 0.157 0.157
fold 5 0.961 0.961
mean 0.769 0.769
Example for a multiple response model:
y1 y2 y3 mean
fold 1 0.915 0.951 0.944 0.937
fold 2 0.962 0.970 0.970 0.968
fold 3 0.914 0.940 0.942 0.932
fold 4 0.907 0.929 0.925 0.920
fold 5 0.947 0.987 0.979 0.971
mean 0.929 0.955 0.952 0.946cv.rsq.tab
but is the MaxErr
at each fold.
This is the signed max absolute value at each fold, where
the signed max absolute value is the maximum of the absolute difference
between the predicted and observed response values, multiplied
by -1
if the sign of the difference is negative.
Also, results are aggregrated using the signed max absolute value instead of the mean.cv.rsq.tab
but is the MeanDev
at each fold.
Binomial models only.cv.rsq.tab
but is the CalibInt
at each fold.
Binomial models only.cv.rsq.tab
but is the CalibSlope
at each fold.
Binomial models only.cv.rsq.tab
but is the AUC
at each fold.
Binomial models only.cv.rsq.tab
but is the cor
at each fold.
Poisson models only.nfold
+1.
Number of MARS terms in the model generated at each cross validation fold,
with the final element being the mean of these.nfold
+1.
Number of predictors in the model generated at each cross validation fold,
with the final element being the mean of these.1:nfold
.cv.rsq
and cv.rsq.per.response
(and, if keepxy
is set, also cv.test.y
and cv.test.fitted.values
).
To save memory, lengthy fields
in the fold models are removed, unless you use the keepxy
argument.
The "lengthy fields" are $bx
, $fitted.values
, and $residuals
.ozone
data to compare mda::mars
with other techniques.
(If you use Faraway's examples with earth
instead of mars
, use $bx
instead of $x
.)
Friedman and Silverman is recommended background reading for the MARS paper.
Earth's pruning pass uses the leaps
package which is based on
techniques in Miller. Faraway (2005) Extending the Linear Model with R
Friedman (1991) Multivariate Adaptive Regression Splines (with discussion)
Annals of Statistics 19/1, 1--141
Friedman (1993) Fast MARS
Stanford University Department of Statistics, Technical Report 110
Friedman and Silverman (1989)
Flexible Parsimonious Smoothing and Additive Modeling
Technometrics, Vol. 31, No. 1.
Hastie, Tibshirani, and Friedman (2001) The Elements of Statistical Learning
Leathwick, J.R., Rowe, D., Richardson, J., Elith, J., & Hastie, T. (2005)
Using multivariate adaptive regression splines to predict the distributions
of New Zealand's freshwater diadromous fish Freshwater Biology, 50, 2034-2052
Miller, Alan (1990, 2nd ed. 2002) Subset Selection in Regression
Wikipedia article on MARS
summary.earth
, plot.earth
,
plotmo
, and evimp
. etitanic
,
evimp
,
format.earth
,
mars.to.earth
,
model.matrix.earth
,
ozone1
,
plot.earth.models
,
plot.earth
,
plotd
,
plotmo
,
predict.earth
,
residuals.earth
,
summary.earth
,
update.earth
a <- earth(Volume ~ ., data = trees)
summary(a, digits = 2, style = "pmax")
# yields:
# Call: earth(formula=Volume~., data=trees)
#
# Volume =
# 23
# + 5.7 * pmax(0, Girth - 13)
# - 2.9 * pmax(0, 13 - Girth)
# + 0.72 * pmax(0, Height - 76)
#
# Selected 4 of 5 terms, and 2 of 2 predictors
# Estimated importance: Girth Height
# Number of terms at each degree of interaction: 1 3 (additive model)
# GCV 11 RSS 213 GRSq 0.96 RSq 0.97
Run the code above in your browser using DataLab