pre (version 0.7.2)

pre: Derive a prediction rule ensemble

Description

pre derives a sparse ensemble of rules and/or linear functions for prediction of a continuous, binary, count, multinomial, multivariate continuous or survival response.

Usage

pre(formula, data, family = gaussian, use.grad = TRUE, weights,
  type = "both", sampfrac = 0.5, maxdepth = 3L, learnrate = 0.01,
  mtry = Inf, ntrees = 500, confirmatory = NULL,
  removecomplements = TRUE, removeduplicates = TRUE,
  winsfrac = 0.025, normalize = TRUE, standardize = FALSE,
  ordinal = TRUE, nfolds = 10L, tree.control, tree.unbiased = TRUE,
  verbose = FALSE, par.init = FALSE, par.final = FALSE,
  sparse = FALSE, ...)

Arguments

formula

a symbolic description of the model to be fit of the form y ~ x1 + x2 + ...+ xn. Response (left-hand side of the formula) should be of class numeric (for family = "gaussian" or "mgaussian"), integer (for family = "poisson"), factor (for family = "binomial" or "multinomial"). See Examples below. Note that the minus sign (-) may not be used in the formula to omit the intercept or variables in data, and neither should + 0 be used to omit the intercept. To omit the intercept from the final ensemble, add intercept = FALSE to the call (although omitting the intercept from the final ensemble will only very rarely be appropriate). To omit variables from the final ensemble, make sure they are excluded from data.

data

data.frame containing the variables in the model. Response must be of class factor for classification, numeric for (count) regression, Surv for survival regression. Input variables must be of class numeric, factor or ordered factor. Otherwise, pre will attempt to recode.

family

specifies a glm family object. Can be a character string (i.e., "gaussian", "binomial", "poisson", "multinomial", "cox" or "mgaussian"), or a corresponding family object (e.g., gaussian, binomial or poisson, see family). Specification of argument family is strongly advised but not required. If family is not specified, Otherwise, the program will try to make an informed guess, based on the class of the response variable specified in formula. als see Examples below.

use.grad

logical. Should gradient boosting with regression trees be employed when learnrate > 0? If TRUE, use trees fitted by ctree or rpart as in Friedman (2001), but without the line search. If use.grad = FALSE, glmtree instead of ctree will be employed for rule induction, yielding longer computation times, higher complexity, but possibly higher predictive accuracy. See Details for supported combinations of family, use.grad and learnrate.

weights

optional vector of observation weights to be used for deriving the ensemble.

type

character. Specifies type of base learners to include in the ensemble. Defaults to "both" (initial ensemble will include both rules and linear functions). Other option are "rules" (prediction rules only) or "linear" (linear functions only).

sampfrac

numeric value \(> 0\) and \(\le 1\). Specifies the fraction of randomly selected training observations used to produce each tree. Values \(< 1\) will result in sampling without replacement (i.e., subsampling), a value of 1 will result in sampling with replacement (i.e., bootstrap sampling). Alternatively, a sampling function may be supplied, which should take arguments n (sample size) and weights.

maxdepth

positive integer. Maximum number of conditions in rules. If length(maxdepth) == 1, it specifies the maximum depth of of each tree grown. If length(maxdepth) == ntrees, it specifies the maximum depth of every consecutive tree grown. Alternatively, a random sampling function may be supplied, which takes argument ntrees and returns integer values. See also maxdepth_sampler.

learnrate

numeric value \(> 0\). Learning rate or boosting parameter.

mtry

positive integer. Number of randomly selected predictor variables for creating each split in each tree. Ignored when tree.unbiased=FALSE.

ntrees

positive integer value. Number of trees to generate for the initial ensemble.

confirmatory

character vector. Specifies one or more confirmatory terms to be included in the final ensemble. Linear terms can be specified as the name of a predictor variable included in data, rules can be specified as, for example, "x1 > 6 & x2 <= 8", where x1 and x2 should be names of variables in data. Terms thus specified will be included in the final ensemble, as their coefficients will not be penalized in the estimation.

removecomplements

logical. Remove rules from the ensemble which are identical to (1 - an earlier rule)?

removeduplicates

logical. Remove rules from the ensemble which are identical to an earlier rule?

winsfrac

numeric value \(> 0\) and \(\le 0.5\). Quantiles of data distribution to be used for winsorizing linear terms. If set to 0, no winsorizing is performed. Note that ordinal variables are included as linear terms in estimating the regression model and will also be winsorized.

normalize

logical. Normalize linear variables before estimating the regression model? Normalizing gives linear terms the same a priori influence as a typical rule, by dividing the (winsorized) linear term by 2.5 times its SD.

standardize

logical. Should rules and linear terms be standardized to have SD equal to 1 before estimating the regression model? This will also standardize the dummified factors, users are advised to use the default standardize = FALSE.

ordinal

logical. Should ordinal variables (i.e., ordered factors) be treated as continuous for generating rules? TRUE (the default) generally yields simpler rules, shorter computation times and better generalizability of the final ensemble.

nfolds

positive integer. Number of cross-validation folds to be used for selecting the optimal value of the penalty parameter \(\lambda\) in selecting the final ensemble.

tree.control

list with control parameters to be passed to the tree fitting function, generated using ctree_control, mob_control (if use.grad = FALSE), or rpart.control (if tree.unbiased = FALSE).

tree.unbiased

logical. Should an unbiased tree generation algorithm be employed for rule generation? Defaults to TRUE, if set to FALSE, rules will be generated employing the CART algorithm (which suffers from biased variable selection) as implemented in rpart. See details below for possible combinations with family, use.grad and learnrate.

verbose

logical. Should progress be printed to the command line?

par.init

logical. Should parallel foreach be used to generate initial ensemble? Only used when learnrate == 0. Note: Must register parallel beforehand, such as doMC or others. Furthermore, setting par.init = TRUE will likely only increase computation time for smaller datasets.

par.final

logical. Should parallel foreach be used to perform cross validation for selecting the final ensemble? Must register parallel beforehand, such as doMC or others.

sparse

logical. Should sparse design matrices be used? Likely improves computation times for large datasets.

...

Additional arguments to be passed to cv.glmnet.

Value

An object of class pre. It contains the initial ensemble of rules and/or linear terms and a range of possible final ensembles. By default, the final ensemble employed by all other methods and functions in package pre is selected using the 'minimum cross validated error plus 1 standard error' criterion. All functions and methods for objects of class pre take a penalty.parameter.val argument, which can be used to select a different criterion.

Details

Note that obervations with missing values will be removed prior to analysis.

In some cases, duplicated variable names may appear in the model. For example, the first variable is a factor named 'V1' and there are also variables named 'V10' and/or 'V11' and/or 'V12' (etc). Then for for the binary factor V1, dummy contrast variables will be created, named 'V10', 'V11', 'V12' (etc). As should be clear from this example, this yields duplicated variable names, which may yield problems, for example in the calculation of predictions and importances, later on. This can be prevented by renaming factor variables with numbers in their name, prior to analysis.

The table below provides an overview of combinations of response variable types, use.grad, tree.unbiased and learnrate settings that are supported, and the tree induction algorithm that will be employed as a result:

use.grad tree.unbiased learnrate family tree alg. Response variable format
TRUE TRUE 0 gaussian ctree
Single, numeric (non-integer) TRUE TRUE 0 mgaussian ctree
Multiple, numeric (non-integer) TRUE TRUE 0 binomial ctree
Single, factor with 2 levels TRUE TRUE 0 multinomial ctree
Single, factor with >2 levels TRUE TRUE 0 poisson ctree
Single, integer TRUE TRUE 0 cox ctree
Object of class 'Surv' TRUE TRUE >0 gaussian
ctree Single, numeric (non-integer) TRUE TRUE >0 mgaussian
ctree Multiple, numeric (non-integer) TRUE TRUE >0 binomial
ctree Single, factor with 2 levels TRUE TRUE >0 multinomial
ctree Single, factor with >2 levels TRUE TRUE >0 poisson
ctree Single, integer TRUE TRUE >0 cox
ctree Object of class 'Surv' FALSE TRUE 0
gaussian glmtree Single, numeric (non-integer) FALSE TRUE 0
binomial glmtree Single, factor with 2 levels FALSE TRUE 0
poisson glmtree Single, integer FALSE TRUE
>0 gaussian glmtree Single, numeric (non-integer) FALSE TRUE
>0 binomial glmtree Single, factor with 2 levels FALSE TRUE
>0 poisson glmtree Single, integer TRUE
FALSE 0 gaussian rpart Single, numeric (non-integer) TRUE
FALSE 0 binomial rpart Single, factor with 2 levels TRUE
FALSE 0 multinomial rpart Single, factor with >2 levels TRUE
FALSE 0 poisson rpart Single, integer TRUE
FALSE 0 cox rpart Object of class 'Surv'
TRUE FALSE >0 gaussian rpart Single, numeric (non-integer)
TRUE FALSE >0 binomial rpart Single, factor with 2 levels
TRUE FALSE >0 poisson rpart Single, integer

References

Fokkema, M. (2018). Fitting prediction rule ensembles with R package pre. https://arxiv.org/abs/1707.07149.

Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. The Annals of Applied Statistics, 29(5), 1189-1232.

Friedman, J. H., & Popescu, B. E. (2008). Predictive learning via rule ensembles. The Annals of Applied Statistics, 2(3), 916-954.

Hothorn, T., & Zeileis, A. (2015). partykit: A modular toolkit for recursive partytioning in R. Journal of Machine Learning Research, 16, 3905-3909.

See Also

print.pre, plot.pre, coef.pre, importance, predict.pre, interact, cvpre

Examples

Run this code
# NOT RUN {
## Fit pre to a continuous response:
airq <- airquality[complete.cases(airquality), ]
set.seed(42)
airq.ens <- pre(Ozone ~ ., data = airq)
airq.ens

## Fit pre to a binary response:
airq2 <- airquality[complete.cases(airquality), ]
airq2$Ozone <- factor(airq2$Ozone > median(airq2$Ozone))
set.seed(42)
airq.ens2 <- pre(Ozone ~ ., data = airq2, family = "binomial")
airq.ens2

## Fit pre to a multivariate continuous response:
airq3 <- airquality[complete.cases(airquality), ] 
set.seed(42)
airq.ens3 <- pre(Ozone + Wind ~ ., data = airq3, family = "mgaussian")
airq.ens3

## Fit pre to a multinomial response:
set.seed(42)
iris.ens <- pre(Species ~ ., data = iris, family = "multinomial")
iris.ens

## Fit pre to a survival response:
library("survival")
lung <- lung[complete.cases(lung), ]
set.seed(42)
lung.ens <- pre(Surv(time, status) ~ ., data = lung, family = "cox")
lung.ens

## Fit pre to a count response:
## Generate random data (partly based on Dobson (1990) Page 93: Randomized 
## Controlled Trial):
counts <- rep(as.integer(c(18, 17, 15, 20, 10, 20, 25, 13, 12)), times = 10)
outcome <- rep(gl(3, 1, 9), times = 10)
treatment <- rep(gl(3, 3), times = 10)
noise1 <- 1:90
set.seed(1)
noise2 <- rnorm(90)
countdata <- data.frame(treatment, outcome, counts, noise1, noise2)
set.seed(42)
count.ens <- pre(counts ~ ., data = countdata, family = "poisson")
count.ens
# }

Run the code above in your browser using DataLab