The function OptPrInDT
applies an iterative technique for finding optimal undersampling percentages
'percl' for the larger class and 'percs' for the smaller class by a nested grid search for the use of the function PrInDT
for
the relationship between the two-class factor variable 'classname' and all other factor and numerical variables
in the data frame 'data' by means of 'N' repetitions of undersampling. The optimization citerion is the balanced accuracy
on the validation sample 'valdat' (default = full sample 'data'). The trees generated from undersampling can be restricted by not accepting trees
including split results specified in the character strings of the vector 'ctestv'.
The inputs plmax and psmax determine the maximal values of the percentages and the inputs distl and dists the
the distances to the next smaller percentage to be tried.
The parameters 'conf.level', 'minsplit', and 'minbucket' can be used to control the size of the trees.
The parameter 'steps' controls, how many of the 3 possible optimization steps should be carried out; default=3.
OptPrInDT(data,classname,ctestv=NA,N=99,plmax=0.09,psmax=0.9,
distl=0.01,dists=0.1,conf.level=0.95,minsplit=NA,minbucket=NA,
valdat=data,steps=3)
best tree on full sample
balanced accuracy of best tree on full sample
undersampling percentage of large class of best tree on full sample
undersampling percentage of small class of best tree on full sample
Input data frame with class factor variable 'classname' and the
influential variables, which need to be factors or numericals (transform logicals and character variables to factors)
Name of class variable (character)
Vector of character strings of forbidden split results;
Example: ctestv <- rbind('variable1 == {value1, value2}','variable2 <= value3'), where
character strings specified in 'value1', 'value2' are not allowed as results of a splitting operation in variable 1 in a tree.
For restrictions of the type 'variable <= xxx', all split results in a tree are excluded with 'variable <= yyy' and yyy <= xxx.
Trees with split results specified in 'ctestv' are not accepted during optimization.
A concrete example is: 'ctestv <- rbind('ETH == {C2a, C1a}','AGE <= 20')' for variables 'ETH' and 'AGE' and values 'C2a','C1a', and '20';
If no restrictions exist, the default = NA is used.
Number (> 7) of repetitions (integer)
Maximal undersampling percentage of larger class (numerical, > 0 and <= 1);
default = 0.09
Maximal undersampling percentage of smaller class (numerical, > 0 and <= 1);
default = 0.9
Distance to the next lower undersampling percentage of larger class (numerical, > 0 and < 1);
default = 0.01
Distance to the next lower undersampling percentage of smaller class (numerical, > 0 and < 1);
default = 0.1
(1 - significance level) in function ctree
(numerical, > 0 and <= 1);
default = 0.95
Minimum number of elements in a node to be splitted;
default = 20
Minimum number of elements in a node;
default = 7
validation data; default = data
number of optimization steps = 1, 2, 3; default = 3
See help("RePrInDT") and help("PrInDT") for further information.
Standard output can be produced by means of print(name$besttree)
or just name$besttree
as well as plot(name$besttree)
where 'name' is the output data
frame of the function.
datastrat <- PrInDT::data_zero
data <- na.omit(datastrat) # cleaned full data: no NAs
# interpretation restrictions (split exclusions)
ctestv <- rbind('ETH == {C2a, C1a}','MLU == {1, 3}') # split exclusions
# call of OptPrInDT
out <- OptPrInDT(data,"real",ctestv,N=24,conf.level=0.99,steps=1) # unstratified
out # print best model and ensembles as well as all observations
plot(out)
Run the code above in your browser using DataLab