ordfor: Ordinal forests

Description

Constructs ordinal forests (OF) as presented in Hornung et al. (in prep). The following tasks can be performed using OF: 1) Predicting the values of an ordinal target variable for new observations based on covariate values (see predict.ordfor); 2) Estimation of the relative widths of the classes of the ordinal target variable; 3) Ranking the importances of the covariates with respect to predicting the values of the ordinal target variable. The default values for the following hyperparameters are appropriate for most applications: ndiv, ntreeperdiv, ntreefinal, npermtrial, and nbest. Thus, these arguments do in general NOT have to be specified by hand. For details on OF see the 'Details' section below.

Usage

ordfor(depvar, data, ndiv = 1000, ntreeperdiv = 100, ntreefinal = 5000,
  perfmeasure = c("equal", "irrespective", "onecateg", "custom"), categimp,
  categweights, npermtrial = 500, nbest = 10, importance = "permutation")

Arguments

depvar

character. Name of the dependent variable in data.

data

data.frame. Data frame containing the covariates and a factor-valued ordinal target variable.

ndiv

integer. Number of random partitions of [0,1] to consider in the optimization.

ntreeperdiv

integer. Number of trees in the smaller regression forests constructed for the ndiv different partitions.

ntreefinal

integer. Number of trees in the larger regression forest constructed using the optimized partition (i.e., the OF).

perfmeasure

character. Performance measure. The default is "equal". See 'Details' subsection 'Performance measures' below and perfm.

categimp

character. Class to priorize if perfmeasure="onecateg".

categweights

numeric. Needed if perfmeasure="custom": vector of length equal to the number of classes. Class weights - classes with higher weights are to be predicted more precisely then those with smaller weights.

npermtrial

integer. Number of permutations to try for each partition considered during the optimization.

nbest

integer. Number of forests to consider when averaging the parititions corresponding to the nbest forests with the highest levels of prediction performance.

importance

character. Variable importance mode, one of 'none', 'impurity', 'permutation'. The 'impurity' measure is the variance of the responses for regression. Passed to the function ranger from the package of the same name.

Value

ordfor returns an object of class ordfor. An object of class "ordfor" is a list containing the following components:

forestfinal

object of class "ranger". Regression forest constructed using the optimized partition of [0,1] (i.e., the OF). Used by predict.ordfor.

bordersbest

vector of length J+1. Average over the nbest best partitions of [0,1]. Used by predict.ordfor.

forests

list of length ndiv. The smaller forests constructed during optimizing the partition of [0,1].

perfmeasurevalues

vector of length ndiv. Performance measure values of the ndiv smaller forests stored in forests.

bordersb

matrix of dimension ndiv x (J+1). All ndiv partitions considered.

classes

character vector of length J. Classes of the target variable.

ndiv

integer. Number of random partitions considered.

ntreeperdiv

integer. Number of trees per partition.

ntreefinal

integer. Number of trees of OF (constructed after optimizing the partition).

perfmeasure

character. Performance measure.

categimp

character. If perfmeasure="onecateg": class to priorize, NA else.

nbest

integer. Number of forests considered in the averaging of the parititions corresponding to the nbest forests with the highest levels of prediction performance.

classfreq

table. Clas frequencies.

varimp

vector of length p. Variable importance for each covariate.

Details

Introduction

Ordinal forests (OF) are a method for ordinal regression with high-dimensional and low-dimensional data that is able to predict the values of the ordinal target variable for new observations and at the same time estimate the relative widths of the classes of the ordinal target variable. Using a (permutation-based) variable importance measure it is moreover possible to rank the importances of the covariates. OF will be presented in an upcoming technical report by Hornung et al..

Methods

The concept of OF is based on the following assumption: There exists a (possibly latent) refined continuous variable y* underlying the observed ordinal target variable y (y in {1,...,J}, J number of classes), where y* determines the values of y. The functional relationship between y* and y takes the form of a monotonically increasing step function. Depending on which of J intervals ]a_1, a_2], ]a_2, a_3], ..., ]a_J, a_{J+1}[ contains the value of y*, the ordinal target variable y takes a different value.

In the context of conditional inference trees, for situations in which the intervals ]a_1, a_2], ..., ]a_J, a_{J+1}[ are available, Hothorn et al. (2006) suggest to use as a metric target variable the midpoints of these intervals. OF are however concerned with settings in which the intervals ]a_1, a_2], ..., ]a_J, a_{J+1}[ are unknown. The main idea of OF is to estimate the unknown interval borders by maximizing the out-of-bag prediction performance (see section "Performance measures") of standard regression forests that use as a target variable a quantity very similar to the midpoints of the intervals. More precisely, we assume y* to have a standard normal distribution and consider the following score values as the values of the target variable: phi^(-1)((phi(a_j) + phi(a_{j+1}))/2) if y = j with a_1 = -infinite and a_{J+1} = +infinite, where "phi" denotes the cumulative distribution function of the standard normal distribution. The assumption of normality is simply made, because normally distributed target variables can be expected to be well suited for standard regression forests.

The estimation of the optimal partition is based on constructing many regression forests (b in 1,...,ndiv) of smaller size, where each of these uses as the values of the target variable a score set obtained using random sampling: {phi^(-1)((phi(a_{b,j}) + phi(a_{b,j+1}))/2) : j in 1,...,J}, where phi(a_{b,1})=0 < phi(a_{b,2}) <... < phi(a_{b,J}) < phi(a_{b,J+1})=1 is a random partition of [0,1].

After estimating the intervals, a larger regression forest is built using as values of the target variable the score values phi^(-1)((phi(a_{opt,j}) + phi(a_{opt,j+1}))/2), where the a_{opt,j} values are the optimized borders.

An important by-product of the construction of OFs are the estimated interval borders a_{opt,1},...,a_{opt,J+1}. The optimized intervals ]phi(a_{opt,1}), phi(a_{opt,2})], ..., ]phi(a_{opt,J}), phi(a_{opt,J+1})[ can be interpreted vividly as the estimated relative widths of the classes 1,...,J.

Further details are given in the sections "Hyperparameters" and "Performance measures".

Hyperparameters

There are several hyperparameters, which do, however, not have to be optimized by the user in general. Expect in the case of nbest the larger these values are, the better, but the default values should be large enough in most applications.

These hyperparameters are described in the following:

ndiv Default value: 1000. The default value of the number of considered partitions in the estimation of the optimal partition is quite large. Such a large number of considered partitions is necessary to attain a high chance that some of the partitions are close enough to the optimal partition, that is, the partition that leads to the optimal performance with respect to the considered performance measure (provided with perfmeasure).
ntreeperdiv Default value: 100. A smaller number of trees considered per tried partition might lead to a too strong variability in the assessments of the performances achieved for the individual partitions.
ntreefinal Default value: 5000. The number of trees ntreefinal plays the same role as in conventional regression forests. For very high-dimensional datasets the default number 5000 might be too small.
npermtrial Default value: 500. As stated above it is necessary to consider a large number of tried partitions ndiv in the optimization in order increase the chance that the best of the considered partitions are close to the optimal partition. However, for a larger number of classes it is in addition necessary to use an efficient algorithm for the sampling of the considered partitions. This algorithm should ensure that the partitions are heterogeneous enough to encertain that the best of the tried partitions are close enough to the optimal partition. In OF we consider an algorithm that has the goal of making the rankings of the widths of the intervals corresponding to the respective classes as dissimilar as possible between successive partitions.

Denote d_{b,j} := phi(a_{b,j}). Then the b-th partition is generated as follows:

If b = 1, then J-1 instances of a U(0,1) distributed random variable are drawn and the values are sorted. The sorted values are designated as d_{1,2},...,d_{1,J}. Moreover, set d_{1,1} := 0 and d_{1,J+1} := 1. The first considered partition is now { d_{1,1}, ... ,d_{1,J+1} }.

If b > 1, the following procedure is performed:

1. Let dist_{b-1,j} = d_{b-1,j+1} - d_{b-1,j} (j in {1,...J}) denote the width of the j-th interval of the b-1-th partition and r_{b-1,j} the rank of dist_{b-1,j} among dist_{b-1,1},...,dist_{b-1,J}. First, npermtrial random permutations of 1,...,J are drawn, where r*_{l,1},...,r*_{l,J} denotes the l-th drawn partition. Subsequently, that permutation r*_{l*,1},...,r*_{l*,J} (l* in {1,...,npermtrial}) is determined that features the greatest quadratic distance to r_{b-1,1},...r_{b-1,J}, that is, r*_{l*,1},...,r*_{l*,J} = argmax_{r*_{l,1},...,r*_{l,J}} sum_{j=1}^J (r*_{l,j} - r_{b-1,j})^2.

2. A partition is generated in the same way as in the case of b = 1.

3. The intervals of the partition generated in 2. are re-ordered in such a way that the width of the j-th interval (for j in {1,...,J}) has rank r*_{l*,j} among the widths of the intervals ]d_{b,1}, d_{b,2}], ..., ]d_{b,J}, d_{b,J+1}]. The b-th considered partition is now { d_{b,1}, ... ,d_{b,J+1} }.

The default value 500 for npermtrial should be large enough to ensure sufficiently heterogeneous partitions - independent from the number of classes J.
nbest Default value: 10. After having constructed ndiv regression forests, each of these using a particular partition, the nbest partitions associated with the highest levels of prediction performance are considered: For each j in {1,...,J} the averages of the phi(a_j) values across these nbest partitions are taken. Then phi^(-1)(.) is applied to the obtained averages resulting in the estimated optimal borders: a_opt,1,...,a_opt,J+1. The latter are then used to calculate the scores for the OF using the following formula phi^(-1)((phi(a_{opt,j}) + phi(a_{opt,j+1}))/2) (see also above). It is probably important that the number nbest of best partitions considered is not strongly misspecified. A too large value of nbest could lead to averaging over a too heterogeneous set of partitions. Conversely, a too small value of nbest could lead to a too large variance in the estimation of the optimal partition. For larger numbers of tried partitions ndiv the results are less sensitive to the choice of nbest, because the number of tried partitions that are close to the optimal partition becomes larger the more partitions are considered. When setting the number of considered partitions ndiv to 1000, the value 10 for nbest specified as a default value in ordfor was seen to lead to a good trade-off between the heterogeneity of the averaged partitions and the variance in the estimation.

Performance measures

As noted above, the different partitions tried during the estimation of the optimal partition are assessed with respect to their (out-of-bag) prediction performance. The choice of the specific performance measure used here determines the specific kind of performance the ordinal forest should feature:

perfmeasure="equal" This choice should be made if it is of interest to predict each class with the same precision irrespective of the numbers of observations representing the individual classes. Youden's J statistic is calculated with respect to each class ("observation/prediction in class j" vs. "observation/prediction NOT in class j" (j=1,...,J)) and the simple average of the J results taken.
perfmeasure="irrespective" This choice should be made if it is of interest to predict all observations with the same precision independent of their categories. Youden's J statistic is calculated with respect to each class and subsequently a weighted average of these values is taken - with weights proportional to the number of observations representing the respective classes in the training data.
perfmeasure="onecateg" This choice should be made if it is merely relevant that a particular class of the classes of the ordinal target variable is predicted as precisely as possible. This particular class must be provided via the argument categ. Youden's J statistic is calculated with respect to class categ ("observation/prediction in class categ" vs. "observation/prediction NOT in class categ").
perfmeasure="custom" This choice should be made if there is a particular ranking of the classes with respect to their importance. Youden's J statistic is calculated with respect to each class. Subsequently, a weighted average with user-specified weights (provided via the argument categweights) is taken. Classes with higher weights are to be predicted more precisely than those with smaller weights.

References

Hothorn, T., Hornik, K., Zeileis, A. (2006) Unbiased recursive partitioning: a conditional inference framework. Journal of Computational and Graphical Statistics, 15, 651--674.

Examples

Run this code

# NOT RUN {
data(hearth)

set.seed(123)
hearthsubset <- hearth[sort(sample(1:nrow(hearth), size=floor(nrow(hearth)*(1/2)))),]
ordforres <- ordfor(depvar="Class", data=hearthsubset, ndiv=80, nbest=5, ntreeperdiv=100, 
  ntreefinal=1000, perfmeasure = "equal")
# NOTE: ndiv=80 is not enough!! In practice, ndiv=1000 (default value) or a higher
# number should be used.

ordforres

sort(ordforres$varimp, decreasing=TRUE)

# }

Run the code above in your browser using DataLab