Mix2SPrInDT: Two-stage estimation for classification-regression mixtures

Description

The function Mix2SPrInDT applies 'N' repetitions of subsampling for finding an optimal subsample to model the relationship between the dependent variables specified in the sublist 'targets' of the list 'datalist' and all other factor and numerical variables in the corresponding data frame specified in the sublist 'datanames' of the list 'datalist' in the same order as 'targets'.
The function is prepared to handle classification tasks with 2 or more classes and regression tasks. At first stage, the targets are estimated only on the basis of the exogenous predictors. At second stage, summaries of the predictions of the endogenous features from the first stage models are added as new predictors.
Subsampling of observations and predictors uses the percentages specified in the matrix 'percent', one row per estimation taks. For classification tasks, percentages 'percl' and 'percs' for the larger and the smaller class have to be specified. For regression tasks, 'pobs' and 'ppre' have to specified for observations and predictors, respectively.
For generating summaries, the variables representing the substructure have to be specified in the sublist 'datastruc' of 'datalist'.
The sublist 'summ' of 'datalist' includes for each discrete target the classes for which you want to calculate the summary percentages and for each continuous target just NA (in the same order as in the sublist 'targets'). For a discrete target in this list, you can provide a sublist of classes to be combined (for an example see the below example).
The optimization citerion is the balanced accuracy for classification and R2 for regression, both evaluated on the full sample.
The trees generated from undersampling can be restricted by not accepting trees including split results specified in the character strings of the vector 'ctestv'.
The parameters 'conf.level', 'minsplit', and 'minbucket' can be used to control the size of the trees.

Usage

Mix2SPrInDT(datalist,ctestv=NA,N=99,percent=NA,conf.level=0.95,minsplit=NA,minbucket=NA)

Value

models1: Best trees at stage 1
models2: Best trees at stage 2
depnames: names of dependent variables

nmod

number of models of tasks

nlev

levels of tasks

accAll

accuracies of best trees at both stages

Arguments

datalist: list(datanames,targets,datastruc,summ) Input data: For specification see the above description
ctestv: Vector of character strings of forbidden split results;
Example: ctestv <- rbind('variable1 == {value1, value2}','variable2 <= value3'), where character strings specified in 'value1', 'value2' are not allowed as results of a splitting operation in variable 1 in a tree.
For restrictions of the type 'variable <= xxx', all split results in a tree are excluded with 'variable <= yyy' and yyy <= xxx.
Trees with split results specified in 'ctestv' are not accepted during optimization.
A concrete example is: 'ctestv <- rbind('ETH == {C2a, C1a}','AGE <= 20')' for variables 'ETH' and 'AGE' and values 'C2a','C1a', and '20';
If no restrictions exist, the default = NA is used.
N: Number of repetitions of subsampling for predictors (integer); default = 99
percent: matrix of percent spefications: For specification see the above description; default: 'percent = NA' meaning default values for percentages.
conf.level: (1 - significance level) in function ctree (numerical, > 0 and <= 1); default = 0.95
minsplit: Minimum number of elements in a node to be splitted; default = 20
minbucket: Minimum number of elements in a node; default = 7

Details

See Buschfeld & Weihs (2025), Optimizing decision trees for the analysis of World Englishes and sociolinguistic data. Cambridge University Press, section 4.5.6.1, for further information.

Standard output can be produced by means of print(name) or just name as well as plot(name where 'name' is the output data frame of the function.

Examples

Run this code

# zero data
datazero <- PrInDT::data_zero
datazero <- na.omit(datazero) # cleaned full data: no NAs
names(datazero)[names(datazero)=="real"] <- "zero"
CHILDzero <- PrInDT::participant_zero
# interpretation restrictions (split exclusions)
ctestv <- rbind('ETH == {C2a, C1a}','MLU == {1, 3}') # split exclusions
##
# multi-level data
datastrat <- PrInDT::data_zero
datamult <- na.omit(datastrat)
# ctestv <- NA
datamult$mult[datamult$ETH %in% c("C1a","C1b","C1c") & datamult$real == "zero"] <- "zero1"
datamult$mult[datamult$ETH %in% c("C2a","C2b","C2c") & datamult$real == "zero"] <- "zero2"
datamult$mult[datamult$real == "realized"] <- "real"
datamult$mult <- as.factor(datamult$mult) # mult is new class variable
datamult$real <- NULL # remove old class variable
CHILDmult <- CHILDzero
##
# vowel data
data <- PrInDT::data_vowel
data <- na.omit(data)
CHILDvowel <- data$Nickname
data$Nickname <- NULL
syllable <- 3 - data$syllables
data$syllabels <- NULL
data$syllables <- syllable
data$speed <- data$word_duration / data$syllables
names(data)[names(data) == "target"] <- "vowel"
datavowel <- data
##
# function preparation and call
datanames <- list("datazero","datamult","datavowel")
targets <- c("zero","mult","vowel")
datastruc <- list(CHILDzero,CHILDmult,CHILDvowel)
summult <- paste("zero1","zero2",sep=",")  
summ <- c("zero",summult,NA)
datalist <- list(datanames=datanames,targets=targets,datastruc=datastruc,summ=summ)
percent <- matrix(NA,nrow=3,ncol=2)
percent[1,] <- c("percl=0.075","percs=0.9") # percentages for datazero
# no percentages needed for datapast
percent[3,] <- c("pobs=0.9","ppre=c(0.9,0.8)") # percentages for datavowel
out2SMix <- Mix2SPrInDT(datalist,ctestv=ctestv,N=19,percent=percent,conf.level=0.99)
out2SMix
plot(out2SMix)

Run the code above in your browser using DataLab