See the package vignette “Notes on the earth package”.
"earth"(formula = stop("no 'formula' argument"), data = NULL, weights = NULL, wp = NULL, subset = NULL, na.action = na.fail, pmethod = c("backward", "none", "exhaustive", "forward", "seqrep", "cv"), keepxy = FALSE, trace = 0, glm = NULL, degree = 1, nprune = NULL, ncross=1, nfold=0, stratify=TRUE, varmod.method = "none", varmod.exponent = 1, varmod.conv = 1, varmod.clamp = .1, varmod.minspan = -3, Scale.y = (NCOL(y)==1), ...)
"earth"(x = stop("no 'x' argument"), y = stop("no 'y' argument"), weights = NULL, wp = NULL, subset = NULL, na.action = na.fail, pmethod = c("backward", "none", "exhaustive", "forward", "seqrep", "cv"), keepxy = FALSE, trace = 0, glm = NULL, degree = 1, nprune = NULL, ncross=1, nfold=0, stratify=TRUE, varmod.method = "none", varmod.exponent = 1, varmod.conv = 1, varmod.clamp = .1, varmod.minspan = -3, Scale.y = (NCOL(y)==1), ...)
"earth"(x = stop("no 'x' argument"), y = stop("no 'y' argument"), weights = NULL, wp = NULL, subset = NULL, na.action = na.fail, pmethod = c("backward", "none", "exhaustive", "forward", "seqrep", "cv"), keepxy = FALSE, trace = 0, glm = NULL, degree = 1, penalty = if(degree > 1) 3 else 2, nk = min(200, max(20, 2 * ncol(x))) + 1, thresh = 0.001, minspan = 0, endspan = 0, newvar.penalty = 0, fast.k = 20, fast.beta = 1, linpreds = FALSE, allowed = NULL, nprune = NULL, Object = NULL, Scale.y = (NCOL(y)==1), Adjust.endspan = 2, Force.weights = FALSE, Use.beta.cache = TRUE, Force.xtx.prune = FALSE, Get.leverages = NROW(x) < 1e5, Exhaustive.tol = 1e-10, ...)
formula
.
x
to use.
Default is NULL, meaning all.
weights
must have length equal to nrow(x)
before applying subset
.
Zero weights are converted to a very small nonzero value.
wp
must have an element for each column of
y
(after factors
in
y
, if any, have been expanded).
Zero weights are converted to a very small nonzero value.
na.fail
, and only na.fail
is supported.
FALSE
.
Set to TRUE
to retain the following in the returned value: x
and y
(or data
),
subset
, and weights
.
The function update.earth
and friends will use these
if present instead of searching for them
in the environment at the time update.earth
is invoked.
When the nfold
argument is used with keepxy=TRUE
,
earth
keeps more data and calls predict.earth
multiple
times to generate cv.oof.rsq.tab
and cv.infold.rsq.tab
(see the cv.
arguments in the “Value” section
below).
It therefore makes cross-validation significantly slower.
earth
's execution. Default is 0
. Values:
0
no tracing
.3
variance model (the varmod.method
arg)
.5
cross validation (the nfold
arg)
1
overview
2
forward pass
3
pruning
4
model mats summary, pruning details
5
full model mats, internal details of operation
glm
.
See the documentation of glm
for a description of these arguments
See “Generalized linear models” in the vignette.
Example:
earth(survived~., data=etitanic, degree=2, glm=list(family=binomial))
The following arguments are for the forward pass.
1
, meaning build an additive model (i.e., no interaction terms).
if(degree>1) 3 else 2
.
Simulation studies suggest values in the range of about 2
to 4
.
The FAQ section in the vignette has some information on GCVs.
Special values (for use by knowledgeable users):
The value 0
penalizes only terms, not knots.
The value -1
means no penalty, so GCV = RSS/n.
nk
because of other stopping conditions.
See “Termination conditions for the forward pass”
in the vignette.
The default is semi-automatically calculated from the number of predictors
but may need adjusting.
0.001
.
This is one of the arguments used to decide when forward stepping
should terminate:
the forward pass terminates if adding a term changes RSq by less than thresh
.
See “Termination conditions for the forward pass” in the vignette.
minspan=0
is treated specially and
means calculate the minspan
internally, as per
Friedman's MARS paper section 3.8 with $alpha$ = 0.05.
Set trace>=2
to see the calculated value.
Use minspan=1
and endspan=1
to consider all x values.
Negative values of minspan
specify the maximum number of knots
per predictor. These will be equally spaced.
For example, minspan=-3
allows three evenly spaced knots for each predictor.
As always, knots that fall in the endzones specified by endspan
will be ignored.
endspan=0
is treated specially and
means calculate the minspan
internally, as per
the MARS paper equation 45 with $alpha$ = 0.05.
Set trace>=2
to see the calculated value.Be wary of reducing endspan
, especially if you plan to make
predictions beyond or near the limits of the training data.
Overfitting near the edges of training data is much more
likely with a small endspan
.
The model's RSq
and GRSq
won't indicate when this
overfitting is occurring.
(A plotmo
plot can help: look for sharp hinges at the
edges of the data). See also the Adjust.endspan
argumen.
0
, meaning no penalty for adding a new variable.
Useful non-zero values typically range from about 0.01
to 0.2
and sometimes higher ---
you will need to experiment.
A word of explanation. With the default newvar.penalty=0
,
if two variables have nearly the same effect (e.g. they are
collinear), at any step in the forward pass earth
will
arbitrarily select one or the other (depending on noise in the sample).
Both variables can appear in the
final model, complicating model interpretation. On the other hand
with a non-zero newvar.penalty
, the forward pass will be
reluctant to add a new variable --- it will rather try to use a
variable already in the model, if that does not affect RSq too much.
The resulting final model may be easier to interpret, if you are lucky.
There will often be a small performance hit (a worse GCV).
20
.
A value of 0
is treated specially
(as being equivalent to infinity), meaning no Fast MARS.
Typical values, apart from 0
, are 20
, 10
, or 5
.
In general, with a lower fast.k
(say 5
), earth
is faster;
with a higher fast.k
, or with fast.k
disabled (set to 0
),
earth
builds a better model.
However, because of random variation this general rule often doesn't apply.
1
.
A value of 0
sometimes gives better results.
lm
.
The default is FALSE
, meaning all predictors enter
in the standard MARS fashion, i.e., in hinge functions. This does not say that a predictor must enter the model;
only that if it enters, it enters linearly.
See “The linpreds
argument” in the vignette.
A predictor's index in linpreds
is the column number in the input matrix x
(after factors have been expanded).
linpreds=TRUE
makes all predictors enter linearly (the TRUE
gets recycled).
linpreds
may also be a character vector e.g.
linpreds=c("wind", "vis")
. Note: grep
is used
for matching. Thus "wind"
will match all variables that have
"wind"
in their names. Use "^wind$"
to match only the
variable named "wind"
.
earth
calls the allowed
function
before considering a term for inclusion; the term can go into the
model only if the allowed
function returns TRUE
.
See “The allowed argument” in the vignette.The following arguments are for the pruning pass.
backward none exhaustive forward seqrep cv
.
Default is "backward"
.
New in version 4.4.0:
Specify pmethod="cv"
to use cross-validation to select the number of terms.
This selects the number of terms that gives the maximum
mean out-of-fold RSq on the fold models.
Requires the nfold
argument.
Use "none"
to retain all the terms created by the forward pass.
If y
has multiple columns, then only "backward"
or "none"
is allowed.
Pruning can take a while if "exhaustive"
is chosen and
the model is big (more than about 30 terms).
The current version of the leaps
package
used during pruning does not allow user interrupts
(i.e., you have to kill your R session to interrupt;
in Windows use the Task Manager or from the command line use taskkill
).
nk
),
or to reduce exhaustive search time with pmethod="exhaustive"
.The following arguments are for cross validation.
nfold>1
.
Number of cross-validations. Each cross-validation has nfold
folds.
Default 1
.
0
, no cross validation.
If greater than 1
, earth
first builds a standard model as usual with all the data.
It then builds nfold
cross-validated models,
measuring R-Squared on the out-of-fold (left out) data each time.
The final cross validation R-Squared (CVRSq
) is the mean of these
out-of-fold R-Squareds.
The above process of building nfold
models is repeated
ncross
times (by default, once).
Use trace=.5
to trace cross-validation.
Further statistics are calculated if keepxy=TRUE
or
if a binomial or poisson model (specified with the glm
argument).
See “Cross validation” in the vignette.
nfold>1
.
Default is TRUE
.
Stratify the cross-validation samples so that
an approximately equal number of cases with a non-zero response
occur in each cross validation subset.
So if the response y
is logical, the TRUE
s will be spread
evenly across folds.
And if the response is a multilevel factor, there will be an
approximately equal number of each factor level in each fold
(because a multilevel factor response gets expanded to columns of zeros and ones,
see “Factors” in the vignette).
We say “approximately equal” because the number of occurrences of a factor
level may not be exactly divisible by the number of folds.The following arguments are for variance models (new in version 4.0.0).
varmod
and the vignette
“Variance models in earth”.
Use trace=.3
to trace construction of the variance model.This argument requires nfold
and ncross
. (We suggest at least ncross=30
here to properly calculate the variance of the errors --- although
you can use a smaller value, say 3
, for debugging.)
The varmod.method
argument should be one of
"none"
Default. Don't build a variance model.
"const"
Assume homoscedastic errors.
"lm"
Use lm
to estimate standard deviation as a
function of the predicted response.
"rlm"
Use rlm
.
"earth"
Use earth
.
"gam"
Use gam
.
This will use either gam
or the mgcv
package, whichever is loaded.
"power"
Estimate standard deviation as
intercept + coef * predicted.response^exponent
,
where
intercept
, coef
, and exponent
will be estimated by nls
.
This is equivalent to varmod.method="lm"
except that exponent
is
automatically estimated instead of being held at the value
set by the varmod.exponent
argument.
"power0"
Same as "power"
but no intercept (offset) term.
"x.lm"
,
"x.rlm"
,
"x.earth"
,
"x.gam"
Like the similarly named options above,
but estimate standard deviation by regressing on the predictors x
(instead of the predicted response).
A current implementation restriction is that "x.gam"
allows only models with one predictor (x
must have only one column).
varmod.method
.
Default is 1
.
For example, with varmod.method="lm"
, if you expect the
standard deviance to increase linearly with the mean response, use
varmod.exponent=1
.
If you expect the standard deviance to increase with the square root
of the mean response, use
varmod.exponent=.5
(where negative response values will be treated as 0
,
and you will get an error message if more than 20% of them are negative).
varmod.conv
percent.
Default is 1
percent.
Negative values force the specified number of iterations,
e.g. varmod.conv=-2
means iterate twice.
Positive values are ignored for varmod="const"
and also currently ignored for varmod="earth"
(these are iterated just once, the same as using varmod.conv=-1
).
min.sd
.
This prevents negative or absurdly small estimated standard deviations.
Clamping takes place in predict.varmod
, which is called
by predict.earth
when estimating prediction intervals.
The value of min.sd
is determined when building the variance
model as min.sd = varmod.clamp * mean(sd(training.residuals))
.
The default varmod.clamp
is 0.1
.
varmod.method="earth"
or "x.earth"
.
This is the minspan
used in the internal call to earth
when creating the variance model (not the main earth
model).
Default is -3
, i.e., three evenly spaced knots per predictor.
Residuals tend to be very noisy, and allowing only this small
number of knots helps prevent overfitting.The following arguments are for internal or advanced use.
update.earth
.
Scale
y
internally in the forward pass
for better numeric stability.
This is invisible to the user, up to numerical differences.
Scaling here means subtract the mean and divide by the standard
deviation. Default is NCOL(y)==1
,
i.e., scale y
unless y
has multiple columns.endspan
gets multiplied by this value.
This reduces the possibility of an overfitted interaction term
supported by just a few cases on the boundary of the predictor space
(as sometimes seen in our simulation studies).
The default is 2
.
Use Adjust.endspan=1
for compatibility with old
versions of earth
.
FALSE
.
For testing the weights
argument.
Force use of the code for handling weights in the earth
code,
even if weights=NULL
or all the weights are the same.
This will not necessarily generate an identical model,
primarily because the non-weighted code requires some tests for
numerical stability that can sometimes affect knot selection.
TRUE
.
Using the “beta cache” takes a little more memory but is faster
(by 20% and often much more for large models).
The beta cache uses nk * nk * ncol(x) * sizeof(double)
bytes.
(The beta cache is an innovation in this implementation of MARS
and does not appear in Friedman's papers. It is not related to
the fast.beta
argument. Certain regression coefficients
in the forward pass can be saved and re-used, thus
saving recalculation time.)
FALSE
.
This argument pertains to subset evaluation in the pruning pass.
By default,
if y
has a single column then earth
calls the leaps
routines;
if y
has multiple columns then earth
calls EvalSubsetsUsingXtx
.
The leaps
routines are numerically more stable
but do not support multiple responses
(leaps
is based on the QR decomposition and
EvalSubsetsUsingXtx
is based on the inverse of X'X).
Setting Force.xtx.prune=TRUE
forces use of EvalSubsetsUsingXtx
, even
if y
has a single column.
TRUE
unless the model has more than 100 thousand cases.
The leverages are the diagonal hat values for the linear regression of
y
on bx
.
The leverages are needed only for certain model checks, for example
when plotres
is called with versus=4
).Details:
This argument was introduced to reduce peak memory usage.
When n >> p
, memory use peaks when earth
is
calculating the leverages.
1e-10
.
Applies only when pmethod="exhaustive"
.
If the reciprocal of the condition number of bx
is less than Exhaustive.tol
, earth
forces pmethod="backward"
.
See “XHAUST returned error code -999” in the vignette.
earth.fit
.
"earth"
.
See earth.object
for a complete description.
ozone
data to compare mda::mars
with other techniques.
(If you use Faraway's examples with earth
instead of mars
, use $bx
instead of $x
, and check out the book's errata.)
Friedman and Silverman is recommended background reading for the MARS paper.
Earth's pruning pass uses code from the leaps
package
which is based on techniques in Miller.Faraway (2005) Extending the Linear Model with R http://www.maths.bath.ac.uk/~jjf23
Friedman (1991) Multivariate Adaptive Regression Splines (with discussion)
Annals of Statistics 19/1, 1--141
https://statistics.stanford.edu/research/multivariate-adaptive-regression-splines
Friedman (1993) Fast MARS
Stanford University Department of Statistics, Technical Report 110
https://statistics.stanford.edu/research/fast-mars
Friedman and Silverman (1989) Flexible Parsimonious Smoothing and Additive Modeling Technometrics, Vol. 31, No. 1. http://links.jstor.org/sici?sici=0040-1706%28198902%2931%3A1%3C3%3AFPSAAM%3E2.0.CO%3B2-Z
Hastie, Tibshirani, and Friedman (2009) The Elements of Statistical Learning (2nd ed.) http://web.stanford.edu/~hastie/pub.htm
Leathwick, J.R., Rowe, D., Richardson, J., Elith, J., & Hastie, T. (2005) Using multivariate adaptive regression splines to predict the distributions of New Zealand's freshwater diadromous fish Freshwater Biology, 50, 2034-2052 http://web.stanford.edu/~hastie/pub.htm, http://www.botany.unimelb.edu.au/envisci/about/staff/elith.html
Miller, Alan (1990, 2nd ed. 2002) Subset Selection in Regression http://wp.csiro.au/alanmiller/index.html
Wikipedia article on MARS http://en.wikipedia.org/wiki/Multivariate_adaptive_regression_splines
summary.earth
, plot.earth
,
evimp
, and plotmo
.Please see the main package vignette “Notes on the earth package”. The vignette can also be downloaded from http://www.milbo.org/doc/earth-notes.pdf.
The vignette
“Variance models in earth”
is also included with the package.
It describes how to build variance models and
generate prediction intervals for earth
models.
earth.mod <- earth(Volume ~ ., data = trees)
plotmo(earth.mod)
summary(earth.mod, digits = 2, style = "pmax")
Run the code above in your browser using DataLab