PrInDTRstruc: Structured subsampling for regression

Description

The function PrInDTRstruc applies structured subsampling for finding an optimal subsample to model the relationship between the continuous variable 'regname' and all other factor and numerical variables in the data frame 'datain' by means of 'M' repetitions of subsampling from a substructure and 'N' repetitions of subsampling from the predictors. The optimization citerion is the goodness of fit R2 on the validation sample 'valdat' (default = 'datain'). The trees generated from undersampling can be restricted by not accepting trees including split results specified in the character strings of the vector 'ctestv'.
The substructure of the observations used for subsampling is specified by the list 'Struc' which consists of the variable 'name' representing the substructure, the name 'check' of the variable with the information about the categories of the substructure, and the matrix 'labs' which specifies the values of 'check' corresponding to two categories in its rows, i.e. in 'labs[1,]' and 'labs[2,]'. The names of the categories have to be specified by rownames(labs).
The number of predictors 'Pit' to be included in the model and the number of elements of the substructure 'Mit' have to be specified (lists allowed).
The percentages of involved observations and predictors can be controlled by the parameters 'pobs' and 'ppre', respectively.
The parameter 'Struc' is needed for all versions of subsampling except "b". Four different versions of structured subsampling exist:
a) just of the elements in the substructure with parameters 'M' and 'Mit',
b) just of the predictors with parameters 'N' and 'Pit',
c) of the predictors and for each subset of predictors subsampling of the elements of the substructure with parameters 'M', 'N', 'Mit', 'Pit', 'pobs', and 'ppre', and
d) of the elements of the substructure and for each of these subsets subsampling of the predictors with the same parameters as version c).
The parameters 'conf.level', 'minsplit', and 'minbucket' can be used to control the size of the trees.
Besides the maximal R2, the minimal MAE (Mean Absolute Error) is reported.

Repeated measurements can also be handled by this function (indrep=1). They are multiple measurements of the same variable taken on the same subjects (or objects) either under different conditions or over two or more time periods.
The variable with the repeatedly observed subjects (or objects) is assumed to be 'name' in 'Struc'.
The measure MAE is split according to the values of 'name'.

Usage

PrInDTRstruc(datain,regname,ctestv=NA,Struc=NA,vers="d",M=NA,Mit=NA,N=99,Pit=NA,
               pobs=c(0.9,0.7),ppre=c(0.9,0.7),conf.level=0.95,minsplit=NA,minbucket=NA,
               valdat=datain,indrep=0)

Value

outmax: Best tree
interp: Number of interpretable trees, overall number of trees
ntmax: Size of training set for best tree
R2max: R squared of best tree

R2sub

Mean R squared of objects in substructure

MAEmax

MAE (Mean Absolute Error) of best tree

MAEsub

Mean MAE of objects in substructure

ind1max

Elements of 1st category of substructure used by best tree

ind2max

Elements of 2nd category of substructure used by best tree

indmax

Predictors used by best tree

gmaxTrain

Training set for best tree

labs

labs from Struc

vers

Version used for structured subsampling

lMit

Number of different numbers of substructure elements

lPit

Number of different numbers of predictors

Number of repetitions of selection of substructure elements

Number of repetitions of selection of predictors

indrep

Indicator of repeated measurements: indrep=1

Arguments

datain: Input data frame with continuous target variable 'regname' and the
influential variables, which need to be factors or numericals (transform logicals and character variables to factors)
regname: Name of target variable (character)
ctestv: Vector of character strings of forbidden split results;
Example: ctestv <- rbind('variable1 == {value1, value2}','variable2 <= value3'), where character strings specified in 'value1', 'value2' are not allowed as results of a splitting operation in variable 1 in a tree.
For restrictions of the type 'variable <= xxx', all split results in a tree are excluded with 'variable <= yyy' and yyy <= xxx.
Trees with split results specified in 'ctestv' are not accepted during optimization.
A concrete example is: 'ctestv <- rbind('ETH == {C2a, C1a}','AGE <= 20')' for variables 'ETH' and 'AGE' and values 'C2a','C1a', and '20';
If no restrictions exist, the default = NA is used.
Struc: = list(name,check,labs), cf. description for explanations; Struc not needed for vers="b"
vers: Version of structured subsampling: "a", "b", "c", "d", cf. description;
default = "d"
M: Number of repetitions of subsampling from substructure (integer) in versions "a" and "d";
default = 99
Mit: List of number of elements of substructure (integers);
default = c((Cl-4):Cl), Cl = maximum number elements in both categories
N: Number of repetitions of subsampling from predictors (integer) in versions "b" and "c";
default = 99
Pit: List of number of predictors (integers)
default = c(max(1,(D-3)):D), D = maximum number of predictors
pobs: Percentage(s) of observations for subsampling in versions "c" and "d";
default=c(0.9,0.7)
ppre: Percentage(s) of predictors for subsampling in versions "c" and "d";
default=c(0.9,0.7)
conf.level: (1 - significance level) in function ctree (numerical, > 0 and <= 1);
default = 0.95
minsplit: Minimum number of elements in a node to be splitted;
default = 20
minbucket: Minimum number of elements in a node;
default = 7
valdat: Validation data; default = datain
indrep: Indicator for repeated measurements, i.e. more than one observation with the same class for each element;
indrep=1: Struc=list(name) only; default = 0

Details

See Buschfeld & Weihs (2025), Optimizing decision trees for the analysis of World Englishes and sociolinguistic data. Cambridge University Press, section 4.5.4, for further information.

Standard output can be produced by means of print(name$besttree) or just name$besttree as well as plot(name$besttree) where 'name' is the output data frame of the function.

Examples

Run this code

data <- PrInDT::data_vowel
data <- na.omit(data)
CHILDvowel <- data$Nickname
data$Nickname <- NULL
data$syllables <- 3 - data$syllables
data$speed <- data$word_duration / data$syllables  ## NEW NEW
names(data)[names(data) == "target"] <- "vowel_length"
# interpretation restrictions (split exclusions)
ctestv <- rbind('ETH == {C2a, C1a}','MLU == {1, 3}') # split exclusions
name <- CHILDvowel
check <- "data$ETH"
labs <- matrix(1:6,nrow=2,ncol=3)
labs[1,] <- c("C1a","C1b","C1c")
labs[2,] <- c("C2a","C2b","C2c")
rownames(labs) <- c("children 1","children 2")
Struc <- list(name=name,check=check,labs=labs)
outstruc <- PrInDTRstruc(data,"vowel_length",ctestv=ctestv,Struc=Struc,vers="d",
                  M=3,Mit=21,N=9,pobs=c(0.95,0.7),ppre=c(1,0.7),conf.level=0.99)
outstruc
plot(outstruc)

Run the code above in your browser using DataLab