Learn R Programming

PrInDT (version 2.0.1)

R2SPrInDT: Two-stage estimation for regression

Description

The function R2SPrInDT applies 'N' repetitions of subsampling for finding an optimal subsample to model the relationship between the continuous variables with indices 'inddep' and all other factor and numerical variables in the data frame 'datain'.
Subsampling of observations and predictors uses the percentages in 'pobs1' and 'ppre1', respectively, at stage 1, and the percentages 'pobs2' and 'ppre2' at stage 2, accordingly. The optimization criterion is the goodness of fit R2 on the full sample.
The trees generated from undersampling can be restricted by not accepting trees including split results specified in the character strings of the vector 'ctestv'.
The parameters 'conf.level', 'minsplit', and 'minbucket' can be used to control the size of the trees.

Usage

R2SPrInDT(data,ctestv=NA,inddep,N=99,pobs1=c(0.90,0.70),ppre1=c(0.90,0.70),
                pobs2=pobs1,ppre2=ppre1,conf.level=0.95,minsplit=NA,minbucket=NA)

Value

models1

Best trees at stage 1

models2

Best trees at stage 2

depnames

names of dependent variables

R2both

R2s of best trees at both stages

Arguments

data

Input data frame with continuous target variable 'regname' and the
influential variables, which need to be factors or numericals (transform logicals and character variables to factors)

ctestv

Vector of character strings of forbidden split results;
Example: ctestv <- rbind('variable1 == {value1, value2}','variable2 <= value3'), where character strings specified in 'value1', 'value2' are not allowed as results of a splitting operation in variable 1 in a tree.
For restrictions of the type 'variable <= xxx', all split results in a tree are excluded with 'variable <= yyy' and yyy <= xxx.
Trees with split results specified in 'ctestv' are not accepted during optimization.
A concrete example is: 'ctestv <- rbind('ETH == {C2a, C1a}','AGE <= 20')' for variables 'ETH' and 'AGE' and values 'C2a','C1a', and '20';
If no restrictions exist, the default = NA is used.

inddep

Column indices of target variables in datain

N

Number of repetitions of subsampling from predictors (integer) in versions "b" and "c";
default = 99

pobs1

Percentage(s) of observations for subsampling at stage 1;
default=c(0.9,0.7)

ppre1

Percentage(s) of predictors for subsampling at stage 1;
default=c(0.9,0.7)

pobs2

Percentage(s) of observations for subsampling at stage 2";
default=pobs1

ppre2

Percentage(s) of predictors for subsampling at stage 2;
default=ppre1

conf.level

(1 - significance level) in function ctree (numerical, > 0 and <= 1);
default = 0.95

minsplit

Minimum number of elements in a node to be splitted;
default = 20

minbucket

Minimum number of elements in a node;
default = 7

Details

See Buschfeld & Weihs (2025), Optimizing decision trees for the analysis of World Englishes and sociolinguistic data. Cambridge University Press, section 4.5.6.1, for further information.

Standard output can be produced by means of print(name) or just name as well as plot(name where 'name' is the output data frame of the function.

Examples

Run this code
data <- PrInDT::data_vowel
data <- na.omit(data)
CHILDvowel <- data$Nickname
data$Nickname <- NULL
syllable <- 3 - data$syllables
data$syllabels <- NULL
data$syllables <- syllable
data$speed <- data$word_duration / data$syllables
names(data)[names(data) == "target"] <- "vowel_length"
# interpretation restrictions (split exclusions)
ctestv <- rbind('ETH == {C2a, C1a}','MLU == {1, 3}') # split exclusions
inddep <- c(13,9) 
out2SR <- R2SPrInDT(data,ctestv=ctestv,inddep=inddep,N=9,conf.level=0.99)
out2SR
plot(out2SR)

Run the code above in your browser using DataLab