pre
derives a sparse ensemble of rules and/or linear functions for
prediction of a continuous, binary, count, multinomial, multivariate
continuous or survival response.
pre(formula, data, family = gaussian, use.grad = TRUE, weights,
type = "both", sampfrac = 0.5, maxdepth = 3L, learnrate = 0.01,
mtry = Inf, ntrees = 500, confirmatory = NULL,
removecomplements = TRUE, removeduplicates = TRUE,
winsfrac = 0.025, normalize = TRUE, standardize = FALSE,
ordinal = TRUE, nfolds = 10L, tree.control, tree.unbiased = TRUE,
verbose = FALSE, par.init = FALSE, par.final = FALSE,
sparse = FALSE, ...)
a symbolic description of the model to be fit of the form
y ~ x1 + x2 + ...+ xn
. Response (left-hand side of the formula)
should be of class numeric (for family = "gaussian"
or
"mgaussian"
), integer (for family = "poisson"
), factor (for
family = "binomial"
or "multinomial"
). See Examples below.
Note that the minus sign (-
) may not be used in the formula to omit
the intercept or variables in data
, and neither should + 0
be used to omit the intercept. To omit the intercept from the final ensemble,
add intercept = FALSE
to the call (although omitting the intercept from
the final ensemble will only very rarely be appropriate). To omit variables
from the final ensemble, make sure they are excluded from data
.
data.frame
containing the variables in the model. Response
must be of class factor
for classification, numeric
for (count)
regression, Surv
for survival regression. Input variables must be of
class numeric, factor or ordered factor. Otherwise, pre
will attempt
to recode.
specifies a glm family object. Can be a character string (i.e.,
"gaussian"
, "binomial"
, "poisson"
, "multinomial"
,
"cox"
or "mgaussian"
), or a corresponding family object
(e.g., gaussian
, binomial
or poisson
, see
family
). Specification of argument family
is
strongly advised but not required. If family
is not specified,
Otherwise, the program will try to make an informed guess, based on the
class of the response variable specified in formula
. als see Examples
below.
logical. Should gradient boosting with regression trees be
employed when learnrate > 0
? If TRUE
, use trees fitted by
ctree
or rpart
as in Friedman
(2001), but without the line search. If use.grad = FALSE
,
glmtree
instead of ctree
will be employed for rule induction, yielding longer computation times,
higher complexity, but possibly higher predictive accuracy. See Details for
supported combinations of family
, use.grad
and learnrate
.
optional vector of observation weights to be used for deriving the ensemble.
character. Specifies type of base learners to include in the
ensemble. Defaults to "both"
(initial ensemble will include both rules
and linear functions). Other option are "rules"
(prediction
rules only) or "linear"
(linear functions only).
numeric value \(> 0\) and \(\le 1\). Specifies
the fraction of randomly selected training observations used to produce each
tree. Values \(< 1\) will result in sampling without replacement (i.e.,
subsampling), a value of 1 will result in sampling with replacement
(i.e., bootstrap sampling). Alternatively, a sampling function may be supplied,
which should take arguments n
(sample size) and weights
.
positive integer. Maximum number of conditions in rules.
If length(maxdepth) == 1
, it specifies the maximum depth of
of each tree grown. If length(maxdepth) == ntrees
, it specifies the
maximum depth of every consecutive tree grown. Alternatively, a random
sampling function may be supplied, which takes argument ntrees
and
returns integer values. See also maxdepth_sampler
.
numeric value \(> 0\). Learning rate or boosting parameter.
positive integer. Number of randomly selected predictor variables for
creating each split in each tree. Ignored when tree.unbiased=FALSE
.
positive integer value. Number of trees to generate for the initial ensemble.
character vector. Specifies one or more confirmatory terms
to be included in the final ensemble. Linear terms can be specified as the
name of a predictor variable included in data
, rules can be specified
as, for example, "x1 > 6 & x2 <= 8"
, where x1 and x2 should be names
of variables in data
. Terms thus specified will be included in the
final ensemble, as their coefficients will not be penalized in the estimation.
logical. Remove rules from the ensemble which are identical to (1 - an earlier rule)?
logical. Remove rules from the ensemble which are identical to an earlier rule?
numeric value \(> 0\) and \(\le 0.5\). Quantiles of data distribution to be used for winsorizing linear terms. If set to 0, no winsorizing is performed. Note that ordinal variables are included as linear terms in estimating the regression model and will also be winsorized.
logical. Normalize linear variables before estimating the regression model? Normalizing gives linear terms the same a priori influence as a typical rule, by dividing the (winsorized) linear term by 2.5 times its SD.
logical. Should rules and linear terms be standardized to
have SD equal to 1 before estimating the regression model? This will also
standardize the dummified factors, users are advised to use the default
standardize = FALSE
.
logical. Should ordinal variables (i.e., ordered factors) be
treated as continuous for generating rules? TRUE
(the default)
generally yields simpler rules, shorter computation times and better
generalizability of the final ensemble.
positive integer. Number of cross-validation folds to be used for selecting the optimal value of the penalty parameter \(\lambda\) in selecting the final ensemble.
list with control parameters to be passed to the tree
fitting function, generated using ctree_control
,
mob_control
(if use.grad = FALSE
), or
rpart.control
(if tree.unbiased = FALSE
).
logical. Should an unbiased tree generation algorithm
be employed for rule generation? Defaults to TRUE
, if set to
FALSE
, rules will be generated employing the CART algorithm
(which suffers from biased variable selection) as implemented in
rpart
. See details below for possible combinations
with family
, use.grad
and learnrate
.
logical. Should progress be printed to the command line?
logical. Should parallel foreach
be used to generate
initial ensemble? Only used when learnrate == 0
. Note: Must register
parallel beforehand, such as doMC or others. Furthermore, setting
par.init = TRUE
will likely only increase computation time for smaller
datasets.
logical. Should parallel foreach
be used to perform cross
validation for selecting the final ensemble? Must register parallel beforehand,
such as doMC or others.
logical. Should sparse design matrices be used? Likely improves computation times for large datasets.
Additional arguments to be passed to
cv.glmnet
.
An object of class pre
. It contains the initial ensemble of
rules and/or linear terms and a range of possible final ensembles.
By default, the final ensemble employed by all other
methods and functions in package pre
is selected using the 'minimum
cross validated error plus 1 standard error' criterion. All functions and
methods for objects of class pre
take a penalty.parameter.val
argument, which can be used to select a different criterion.
Note that obervations with missing values will be removed prior to analysis.
In some cases, duplicated variable names may appear in the model. For example, the first variable is a factor named 'V1' and there are also variables named 'V10' and/or 'V11' and/or 'V12' (etc). Then for for the binary factor V1, dummy contrast variables will be created, named 'V10', 'V11', 'V12' (etc). As should be clear from this example, this yields duplicated variable names, which may yield problems, for example in the calculation of predictions and importances, later on. This can be prevented by renaming factor variables with numbers in their name, prior to analysis.
The table below provides an overview of combinations of response
variable types, use.grad
, tree.unbiased
and
learnrate
settings that are supported, and the tree induction
algorithm that will be employed as a result:
use.grad | tree.unbiased | learnrate | family | tree alg. | Response variable format |
TRUE | TRUE | 0 | gaussian | ctree | |
Single, numeric (non-integer) | TRUE | TRUE | 0 | mgaussian | ctree |
Multiple, numeric (non-integer) | TRUE | TRUE | 0 | binomial | ctree |
Single, factor with 2 levels | TRUE | TRUE | 0 | multinomial | ctree |
Single, factor with >2 levels | TRUE | TRUE | 0 | poisson | ctree |
Single, integer | TRUE | TRUE | 0 | cox | ctree |
Object of class 'Surv' | TRUE | TRUE | >0 | gaussian | |
ctree | Single, numeric (non-integer) | TRUE | TRUE | >0 | mgaussian |
ctree | Multiple, numeric (non-integer) | TRUE | TRUE | >0 | binomial |
ctree | Single, factor with 2 levels | TRUE | TRUE | >0 | multinomial |
ctree | Single, factor with >2 levels | TRUE | TRUE | >0 | poisson |
ctree | Single, integer | TRUE | TRUE | >0 | cox |
ctree | Object of class 'Surv' | FALSE | TRUE | 0 | |
gaussian | glmtree | Single, numeric (non-integer) | FALSE | TRUE | 0 |
binomial | glmtree | Single, factor with 2 levels | FALSE | TRUE | 0 |
poisson | glmtree | Single, integer | FALSE | TRUE | |
>0 | gaussian | glmtree | Single, numeric (non-integer) | FALSE | TRUE |
>0 | binomial | glmtree | Single, factor with 2 levels | FALSE | TRUE |
>0 | poisson | glmtree | Single, integer | TRUE | |
FALSE | 0 | gaussian | rpart | Single, numeric (non-integer) | TRUE |
FALSE | 0 | binomial | rpart | Single, factor with 2 levels | TRUE |
FALSE | 0 | multinomial | rpart | Single, factor with >2 levels | TRUE |
FALSE | 0 | poisson | rpart | Single, integer | TRUE |
FALSE | 0 | cox | rpart | Object of class 'Surv' | |
TRUE | FALSE | >0 | gaussian | rpart | Single, numeric (non-integer) |
TRUE | FALSE | >0 | binomial | rpart | Single, factor with 2 levels |
TRUE | FALSE | >0 | poisson | rpart | Single, integer |
Fokkema, M. (2018). Fitting prediction rule ensembles with R package pre. https://arxiv.org/abs/1707.07149.
Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. The Annals of Applied Statistics, 29(5), 1189-1232.
Friedman, J. H., & Popescu, B. E. (2008). Predictive learning via rule ensembles. The Annals of Applied Statistics, 2(3), 916-954.
Hothorn, T., & Zeileis, A. (2015). partykit: A modular toolkit for recursive partytioning in R. Journal of Machine Learning Research, 16, 3905-3909.
print.pre
, plot.pre
,
coef.pre
, importance
, predict.pre
,
interact
, cvpre
# NOT RUN {
## Fit pre to a continuous response:
airq <- airquality[complete.cases(airquality), ]
set.seed(42)
airq.ens <- pre(Ozone ~ ., data = airq)
airq.ens
## Fit pre to a binary response:
airq2 <- airquality[complete.cases(airquality), ]
airq2$Ozone <- factor(airq2$Ozone > median(airq2$Ozone))
set.seed(42)
airq.ens2 <- pre(Ozone ~ ., data = airq2, family = "binomial")
airq.ens2
## Fit pre to a multivariate continuous response:
airq3 <- airquality[complete.cases(airquality), ]
set.seed(42)
airq.ens3 <- pre(Ozone + Wind ~ ., data = airq3, family = "mgaussian")
airq.ens3
## Fit pre to a multinomial response:
set.seed(42)
iris.ens <- pre(Species ~ ., data = iris, family = "multinomial")
iris.ens
## Fit pre to a survival response:
library("survival")
lung <- lung[complete.cases(lung), ]
set.seed(42)
lung.ens <- pre(Surv(time, status) ~ ., data = lung, family = "cox")
lung.ens
## Fit pre to a count response:
## Generate random data (partly based on Dobson (1990) Page 93: Randomized
## Controlled Trial):
counts <- rep(as.integer(c(18, 17, 15, 20, 10, 20, 25, 13, 12)), times = 10)
outcome <- rep(gl(3, 1, 9), times = 10)
treatment <- rep(gl(3, 3), times = 10)
noise1 <- 1:90
set.seed(1)
noise2 <- rnorm(90)
countdata <- data.frame(treatment, outcome, counts, noise1, noise2)
set.seed(42)
count.ens <- pre(counts ~ ., data = countdata, family = "poisson")
count.ens
# }
Run the code above in your browser using DataLab