Conditional Inference Trees: Conditional Inference Trees

Description

Recursive partitioning for continuous, censored, ordered, nominal and multivariate response variables in a conditional inference framework.

Usage

ctree(formula, data, subset = NULL, weights = NULL, 
      controls = ctree_control(), xtrafo = ptrafo, ytrafo = ptrafo, 
      scores = NULL)

Arguments

formula

a symbolic description of the model to be fit. Note that symbols like : and - will not work and the tree will make use of all variables listed on the rhs of formula.

data

a data frame containing the variables in the model.

subset

an optional vector specifying a subset of observations to be used in the fitting process.

weights

an optional vector of weights to be used in the fitting process. Only non-negative integer valued weights are allowed.

controls

an object of class TreeControl, which can be obtained using ctree_control.

xtrafo

a function to be applied to all input variables. By default, the ptrafo function is applied.

ytrafo

a function to be applied to all response variables. By default, the ptrafo function is applied.

scores

an optional named list of scores to be attached to ordered factors.

Value

An object of class BinaryTree-class.

Details

Conditional inference trees estimate a regression relationship by binary recursive partitioning in a conditional inference framework. Roughly, the algorithm works as follows: 1) Test the global null hypothesis of independence between any of the input variables and the response (which may be multivariate as well). Stop if this hypothesis cannot be rejected. Otherwise select the input variable with strongest association to the resonse. This association is measured by a p-value corresponding to a test for the partial null hypothesis of a single input variable and the response. 2) Implement a binary split in the selected input variable. 3) Recursively repeate steps 1) and 2).

The implementation utilizes a unified framework for conditional inference, or permutation tests, developed by Strasser and Weber (1999). The stop criterion in step 1) is either based on multiplicity adjusted p-values (testtype == "Bonferroni" or testtype == "MonteCarlo" in ctree_control), on the univariate p-values (testtype == "Univariate"), or on values of the test statistic (testtype == "Teststatistic"). In both cases, the criterion is maximized, i.e., 1 - p-value is used. A split is implemented when the criterion exceeds the value given by mincriterion as specified in ctree_control. For example, when mincriterion = 0.95, the p-value must be smaller than $0.05$ in order to split this node. This statistical approach ensures that the right sized tree is grown and no form of pruning or cross-validation or whatsoever is needed. The selection of the input variable to split in is based on the univariate p-values avoiding a variable selection bias towards input variables with many possible cutpoints.

Multiplicity-adjusted Monte-Carlo p-values are computed following a "min-p" approach. The univariate p-values based on the limiting distribution (chi-square or normal) are computed for each of the random permutations of the data. This means that one should use a quadratic test statistic when factors are in play (because the evaluation of the corresponding multivariate normal distribution is time-consuming).

By default, the scores for each ordinal factor x are 1:length(x), this may be changed using scores = list(x = c(1,5,6)), for example.

Predictions can be computed using predict or treeresponse. The first function accepts arguments type = c("response", "node", "prob") where type = "response" returns predicted means, predicted classes or median predicted survival times, type = "node" returns terminal node IDs (identical to where) and type = "prob" gives more information about the conditional distribution of the response, i.e., class probabilities or predicted Kaplan-Meier curves and is identical to treeresponse. For observations with zero weights, predictions are computed from the fitted tree when newdata = NULL.

For a general description of the methodology see Hothorn, Hornik and Zeileis (2006) and Hothorn, Hornik, van de Wiel and Zeileis (2006). Introductions for novices can be found in Strobl et al. (2009) and at http://github.com/christophM/overview-ctrees.git.

References

Helmut Strasser and Christian Weber (1999). On the asymptotic theory of permutation statistics. Mathematical Methods of Statistics, 8, 220--250.

Torsten Hothorn, Kurt Hornik, Mark A. van de Wiel and Achim Zeileis (2006). A Lego System for Conditional Inference. The American Statistician, 60(3), 257--263.

Torsten Hothorn, Kurt Hornik and Achim Zeileis (2006). Unbiased Recursive Partitioning: A Conditional Inference Framework. Journal of Computational and Graphical Statistics, 15(3), 651--674. Preprint available from http://statmath.wu-wien.ac.at/~zeileis/papers/Hothorn+Hornik+Zeileis-2006.pdf

Carolin Strobl, James Malley and Gerhard Tutz (2009). An Introduction to Recursive Partitioning: Rationale, Application, and Characteristics of Classification and Regression Trees, Bagging, and Random forests. Psychological Methods, 14(4), 323--348.

Examples

Run this code

# NOT RUN {
    set.seed(290875)

    ### regression
    airq <- subset(airquality, !is.na(Ozone))
    airct <- ctree(Ozone ~ ., data = airq, 
                   controls = ctree_control(maxsurrogate = 3))
    airct
    plot(airct)
    mean((airq$Ozone - predict(airct))^2)
    ### extract terminal node ID, two ways
    all.equal(predict(airct, type = "node"), where(airct))

    ### classification
    irisct <- ctree(Species ~ .,data = iris)
    irisct
    plot(irisct)
    table(predict(irisct), iris$Species)

    ### estimated class probabilities, a list
    tr <- treeresponse(irisct, newdata = iris[1:10,])

    ### ordinal regression
    data("mammoexp", package = "TH.data")
    mammoct <- ctree(ME ~ ., data = mammoexp) 
    plot(mammoct)

    ### estimated class probabilities
    treeresponse(mammoct, newdata = mammoexp[1:10,])

    ### survival analysis
    if (require("TH.data") && require("survival")) {
        data("GBSG2", package = "TH.data")
        GBSG2ct <- ctree(Surv(time, cens) ~ .,data = GBSG2)
        plot(GBSG2ct)
        treeresponse(GBSG2ct, newdata = GBSG2[1:2,])        
    }

    ### if you are interested in the internals:
    ### generate doxygen documentation
    
# }
# NOT RUN {
        ### download src package into temp dir
        tmpdir <- tempdir()
        tgz <- download.packages("party", destdir = tmpdir)[2]
        ### extract
        untar(tgz, exdir = tmpdir)
        wd <- setwd(file.path(tmpdir, "party"))
        ### run doxygen (assuming it is there)
        system("doxygen inst/doxygen.cfg")
        setwd(wd)
        ### have fun
        browseURL(file.path(tmpdir, "party", "inst", 
                            "documentation", "html", "index.html")) 
    
# }

Run the code above in your browser using DataLab