pre
derives a sparse ensemble of rules and/or linear functions for
prediction of a continuous or binary outcome.
pre(formula, data, family = c("gaussian", "binomial", "poisson"),
use.grad = TRUE, weights, type = "both", sampfrac = 0.5,
maxdepth = 3L, learnrate = 0.01, mtry = Inf, ntrees = 500,
removecomplements = TRUE, removeduplicates = TRUE, winsfrac = 0.025,
normalize = TRUE, standardize = FALSE, nfolds = 10L, verbose = FALSE,
par.init = FALSE, par.final = FALSE, tree.control, ...)
a symbolic description of the model to be fit of the form
y ~ x1 + x2 + ...+ xn
. Response (left-hand side of the formula)
should be of class numeric or of class factor (with two levels). If the
response is a factor, an ensemble for binary classification is created.
Otherwise, an ensemble for prediction of a numeric response is created. If
the outcome is a non-negative count, this should additionally be specified
by settingfamily = "poisson"
. Note that input variables may not have
'rule' as (part of) their name, and the formula may not exclude the intercept
(that is, + 0
or - 1
may not be used in the right-hand side of
the formula).
data.frame containing the variables in the model. Response must be a factor for binary classification, numeric for (count) regression. Input variables must be of class numeric, factor or ordered factor.
character. Specification is required only for non-negative
count responses, by specifying family = "poisson"
. Otherwise,
family = "gaussian"
is employed if response specified in formula
is numeric and family = "binomial"
is employed if response is a
binary factor. Note that if family = "poisson"
is specified,
glmtree
with an intercept only models in the nodes
will be employed for inducing trees, instead of ctree
.
Although this yields longer computation times, it also yields better
accuracy for count outcomes.
logical. Should binary outcomes use gradient boosting with
regression trees when learnrate > 0
? That is, use
ctree
as in Friedman (2001) with a second order
Taylor expansion? By default set to TRUE
, as this yields shorter
computation time. If set to FALSE
, glmtree
with intercept only models in the nodes will be employed. This will yield
longer computation times (but may increase the likelihood of detecting
interactions).
an optional vector of observation weights to be used for deriving the ensemble.
character. Specifies type of base learners to be included in the
ensemble. Defaults to "both"
(initial ensemble will include both rules
and linear functions). Other option are "rules"
(prediction
rules only) or "linear"
(linear functions only).
numeric. Takes values \(> 0\) and \(\leq 1\), representing the fraction of randomly selected training observations used to produce each tree. Values \(< 1\) will result in sampling without replacement (i.e., subsampling), a value of 1 will result in sampling with replacement (i.e., bootstrapping).
numeric. Maximum number of conditions that can define a rule.
numeric. Learning rate or boosting parameter.
numeric. Number of randomly selected predictor variables for creating each split in each tree.
numeric. Number of trees to generate for the initial ensemble.
logical. Remove rules from the ensemble which have the same support in the training data as the inverse of other rules?
logical. Remove rules from the ensemble which have the exact same support in training data?
numeric. Quantiles of data distribution to be used for winsorizing linear terms. If set to 0, no winsorizing is performed. Note that ordinal variables are included as linear terms in estimating the regression model and will also be winsorized.
logical. Normalize linear variables before estimating the regression model? Normalizing gives linear terms the same a priori influence as a typical rule, by dividing the (winsorized) linear term by 2.5 times its SD.
logical. Should rules and linear terms be standardized to
have SD equal to 1 before estimating the regression model? This will also
standardize the dummified factors, users are advised to use the default
standardize = FALSE
.
numeric. Number of cross-validation folds to be used for selecting the optimal value of the penalty parameter \(\lambda\) in selecting the final ensemble.
logical. Should information on the initial and final ensemble be printed to the command line?
logical. Should parallel foreach be used to generate initial
ensemble? Only used when learnrate == 0
and family != "poisson"
.
Must register parallel beforehand, such as doMC or others.
logical. Should parallel foreach be used to perform cross validation for selecting the final ensemble? Must register parallel beforehand, such as doMC or others.
list with control parameters to be passed to the tree
fitting function, see ctree_control
.
Additional arguments to be passed to
cv.glmnet
.
an object of class pre
, which contains the initial ensemble of
rules and/or linear terms and the final ensembles for a wide range of penalty
parameter values. By default, the final ensemble employed by all of the other
methods and functions in package pre
is selected using the 'minimum
cross validated error plus 1 standard error' criterion. All functions and
methods also take a penalty.parameter.value
argument, which can be
used to select a more or less sparse final ensembles. The
penalty.parameter.value
argument takes values "lambda.1se"
(the default), "lambda.min"
, or a numeric value. Users can assess
the trade of between sparsity and accuracy provided by every possible value
of the penalty parameter (\(\lambda\)) by running object$glmnet.fit
and plot(object$glmnet.fit)
.
Obervations with missing values will be removed prior to analysis.
In rare cases, duplicated variable names may appear in the model. For example, the first variable is a factor named 'V1' and there are also non-factor variables called 'V10' and/or 'V11' and/or 'V12' (etc). Then for the binary factor V1, dummy contrast variables will be created, called 'V10', 'V11', 'V12' (etc). As should be clear from this example, this yields duplicated variable names, which will yield warnings, errors and incorrect results. Users should prevent this by renaming variables prior to analysis.
Inputs can be numeric, ordered or factor variables. Reponse can be a numeric, count or binary categorical variable.
Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. The Annals of Applied Statistics, 29(5), 1189-1232.
print.pre
, plot.pre
,
coef.pre
, importance
, predict.pre
,
interact
, cvpre
# NOT RUN {
set.seed(42)
airq.ens <- pre(Ozone ~ ., data = airquality[complete.cases(airquality),], verbose = TRUE)
# }
Run the code above in your browser using DataLab