pre: Derive a prediction rule ensemble

Description

pre derives a sparse ensemble of rules and/or linear functions for prediction of a continuous or binary outcome.

Usage

pre(formula, data, family = c("gaussian", "binomial", "poisson"),
  use.grad = TRUE, weights, type = "both", sampfrac = 0.5,
  maxdepth = 3L, learnrate = 0.01, mtry = Inf, ntrees = 500,
  removecomplements = TRUE, removeduplicates = TRUE, winsfrac = 0.025,
  normalize = TRUE, standardize = FALSE, nfolds = 10L, verbose = FALSE,
  par.init = FALSE, par.final = FALSE, tree.control, ...)

Arguments

formula

a symbolic description of the model to be fit of the form y ~ x1 + x2 + ...+ xn. Response (left-hand side of the formula) should be of class numeric or of class factor (with two levels). If the response is a factor, an ensemble for binary classification is created. Otherwise, an ensemble for prediction of a numeric response is created. If the outcome is a non-negative count, this should additionally be specified by settingfamily = "poisson". Note that input variables may not have 'rule' as (part of) their name, and the formula may not exclude the intercept (that is, + 0 or - 1 may not be used in the right-hand side of the formula).

data

data.frame containing the variables in the model. Response must be a factor for binary classification, numeric for (count) regression. Input variables must be of class numeric, factor or ordered factor.

family

character. Specification is required only for non-negative count responses, by specifying family = "poisson". Otherwise, family = "gaussian" is employed if response specified in formula is numeric and family = "binomial" is employed if response is a binary factor. Note that if family = "poisson" is specified, glmtree with an intercept only models in the nodes will be employed for inducing trees, instead of ctree. Although this yields longer computation times, it also yields better accuracy for count outcomes.

use.grad

logical. Should binary outcomes use gradient boosting with regression trees when learnrate > 0? That is, use ctree as in Friedman (2001) with a second order Taylor expansion? By default set to TRUE, as this yields shorter computation time. If set to FALSE, glmtree with intercept only models in the nodes will be employed. This will yield longer computation times (but may increase the likelihood of detecting interactions).

weights

an optional vector of observation weights to be used for deriving the ensemble.

type

character. Specifies type of base learners to be included in the ensemble. Defaults to "both" (initial ensemble will include both rules and linear functions). Other option are "rules" (prediction rules only) or "linear" (linear functions only).

sampfrac

numeric. Takes values $> 0$ and $\leq 1$, representing the fraction of randomly selected training observations used to produce each tree. Values $< 1$ will result in sampling without replacement (i.e., subsampling), a value of 1 will result in sampling with replacement (i.e., bootstrapping).

maxdepth

numeric. Maximum number of conditions that can define a rule.

learnrate

numeric. Learning rate or boosting parameter.

mtry

numeric. Number of randomly selected predictor variables for creating each split in each tree.

ntrees

numeric. Number of trees to generate for the initial ensemble.

removecomplements

logical. Remove rules from the ensemble which have the same support in the training data as the inverse of other rules?

removeduplicates

logical. Remove rules from the ensemble which have the exact same support in training data?

winsfrac

numeric. Quantiles of data distribution to be used for winsorizing linear terms. If set to 0, no winsorizing is performed. Note that ordinal variables are included as linear terms in estimating the regression model and will also be winsorized.

normalize

logical. Normalize linear variables before estimating the regression model? Normalizing gives linear terms the same a priori influence as a typical rule, by dividing the (winsorized) linear term by 2.5 times its SD.

standardize

logical. Should rules and linear terms be standardized to have SD equal to 1 before estimating the regression model? This will also standardize the dummified factors, users are advised to use the default standardize = FALSE.

nfolds

numeric. Number of cross-validation folds to be used for selecting the optimal value of the penalty parameter $\lambda$ in selecting the final ensemble.

verbose

logical. Should information on the initial and final ensemble be printed to the command line?

par.init

logical. Should parallel foreach be used to generate initial ensemble? Only used when learnrate == 0 and family != "poisson". Must register parallel beforehand, such as doMC or others.

par.final

logical. Should parallel foreach be used to perform cross validation for selecting the final ensemble? Must register parallel beforehand, such as doMC or others.

tree.control

list with control parameters to be passed to the tree fitting function, see ctree_control.

...

Additional arguments to be passed to cv.glmnet.

Value

an object of class pre, which contains the initial ensemble of rules and/or linear terms and the final ensembles for a wide range of penalty parameter values. By default, the final ensemble employed by all of the other methods and functions in package pre is selected using the 'minimum cross validated error plus 1 standard error' criterion. All functions and methods also take a penalty.parameter.value argument, which can be used to select a more or less sparse final ensembles. The penalty.parameter.value argument takes values "lambda.1se" (the default), "lambda.min", or a numeric value. Users can assess the trade of between sparsity and accuracy provided by every possible value of the penalty parameter ($\lambda$) by running object$glmnet.fit and plot(object$glmnet.fit).

Details

Obervations with missing values will be removed prior to analysis.

In rare cases, duplicated variable names may appear in the model. For example, the first variable is a factor named 'V1' and there are also non-factor variables called 'V10' and/or 'V11' and/or 'V12' (etc). Then for the binary factor V1, dummy contrast variables will be created, called 'V10', 'V11', 'V12' (etc). As should be clear from this example, this yields duplicated variable names, which will yield warnings, errors and incorrect results. Users should prevent this by renaming variables prior to analysis.

Inputs can be numeric, ordered or factor variables. Reponse can be a numeric, count or binary categorical variable.

References

Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. The Annals of Applied Statistics, 29(5), 1189-1232.

Examples

Run this code

# NOT RUN {
set.seed(42)
airq.ens <- pre(Ozone ~ ., data = airquality[complete.cases(airquality),], verbose = TRUE)
# }

Run the code above in your browser using DataLab