OptPrInDT: Optimisation of undersampling percentages for classification

Description

The function OptPrInDT applies an iterative technique for finding optimal undersampling percentages 'percl' for the larger class and 'percs' for the smaller class by a nested grid search for the use of the function PrInDT for the relationship between the two-class factor variable 'classname' and all other factor and numerical variables in the data frame 'data' by means of 'N' repetitions of undersampling. The optimization citerion is the balanced accuracy on the validation sample 'valdat' (default = full sample 'data'). The trees generated from undersampling can be restricted by not accepting trees including split results specified in the character strings of the vector 'ctestv'.
The inputs plmax and psmax determine the maximal values of the percentages and the inputs distl and dists the the distances to the next smaller percentage to be tried.
The parameters 'conf.level', 'minsplit', and 'minbucket' can be used to control the size of the trees.
The parameter 'steps' controls, how many of the 3 possible optimization steps should be carried out; default=3.

Usage

OptPrInDT(data,classname,ctestv=NA,N=99,plmax=0.09,psmax=0.9,
               distl=0.01,dists=0.1,conf.level=0.95,minsplit=NA,minbucket=NA,
               valdat=data,steps=3)

Value

besttree: best tree on full sample

bestba

balanced accuracy of best tree on full sample

percl

undersampling percentage of large class of best tree on full sample

percs

undersampling percentage of small class of best tree on full sample

Arguments

data: Input data frame with class factor variable 'classname' and the
influential variables, which need to be factors or numericals (transform logicals and character variables to factors)
classname: Name of class variable (character)
ctestv: Vector of character strings of forbidden split results;
Example: ctestv <- rbind('variable1 == {value1, value2}','variable2 <= value3'), where character strings specified in 'value1', 'value2' are not allowed as results of a splitting operation in variable 1 in a tree.
For restrictions of the type 'variable <= xxx', all split results in a tree are excluded with 'variable <= yyy' and yyy <= xxx.
Trees with split results specified in 'ctestv' are not accepted during optimization.
A concrete example is: 'ctestv <- rbind('ETH == {C2a, C1a}','AGE <= 20')' for variables 'ETH' and 'AGE' and values 'C2a','C1a', and '20';
If no restrictions exist, the default = NA is used.
N: Number (> 7) of repetitions (integer)
plmax: Maximal undersampling percentage of larger class (numerical, > 0 and <= 1);
default = 0.09
psmax: Maximal undersampling percentage of smaller class (numerical, > 0 and <= 1);
default = 0.9
distl: Distance to the next lower undersampling percentage of larger class (numerical, > 0 and < 1);
default = 0.01
dists: Distance to the next lower undersampling percentage of smaller class (numerical, > 0 and < 1);
default = 0.1
conf.level: (1 - significance level) in function ctree (numerical, > 0 and <= 1);
default = 0.95
minsplit: Minimum number of elements in a node to be splitted;
default = 20
minbucket: Minimum number of elements in a node;
default = 7
valdat: validation data; default = data
steps: number of optimization steps = 1, 2, 3; default = 3

Details

See help("RePrInDT") and help("PrInDT") for further information.

Standard output can be produced by means of print(name$besttree) or just name$besttree as well as plot(name$besttree) where 'name' is the output data frame of the function.

Examples

Run this code

datastrat <- PrInDT::data_zero
data <- na.omit(datastrat) # cleaned full data: no NAs
# interpretation restrictions (split exclusions)
ctestv <- rbind('ETH == {C2a, C1a}','MLU == {1, 3}') # split exclusions
# call of OptPrInDT
out <- OptPrInDT(data,"real",ctestv,N=24,conf.level=0.99,steps=1) # unstratified
out # print best model and ensembles as well as all observations
plot(out)

Run the code above in your browser using DataLab