Learn R Programming

FRESA.CAD (version 2.2.0)

ForwardSelection.Model.Res: NeRI-based feature selection procedure for linear, logistic, or Cox proportional hazards regression models

Description

This function performs a bootstrap sampling to rank the most frequent variables that statistically aid the models by minimizing the residuals. After the frequency rank, the function uses a forward selection procedure to create a final model, whose terms all have a significant contribution to the net residual improvement (NeRI).

Usage

ForwardSelection.Model.Res(size = 100, fraction = 1, pvalue = 0.05, loops = 100, covariates = "1", Outcome, variableList, data, maxTrainModelSize = 10, type = c("LM", "LOGIT", "COX"), testType=c("Binomial", "Wilcox", "tStudent", "Ftest"), timeOutcome = "Time", loop.threshold = 20, interaction = 1, cores = 4)

Arguments

size
The number of candidate variables to be tested (the first size variables from variableList)
fraction
The fraction of data (sampled with replacement) to be used as train
pvalue
The maximum p-value, associated to the NeRI, allowed for a term in the model (controls the false selection rate)
loops
The number of bootstrap loops
covariates
A string of the type "1 + var1 + var2" that defines which variables will always be included in the models (as covariates)
Outcome
The name of the column in data that stores the variable to be predicted by the model
variableList
A data frame with two columns. The first one must have the names of the candidate variables and the other one the description of such variables
data
A data frame where all variables are stored in different columns
maxTrainModelSize
Maximum number of terms that can be included in the model
type
Fit type: Logistic ("LOGIT"), linear ("LM"), or Cox proportional hazards ("COX")
testType
Type of non-parametric test to be evaluated by the improvedResiduals function: Binomial test ("Binomial"), Wilcoxon rank-sum test ("Wilcox"), Student's t-test ("tStudent"), or F-test ("Ftest")
timeOutcome
The name of the column in data that stores the time to event (needed only for a Cox proportional hazards regression model fitting)
loop.threshold
After loop.threshold cycles, only variables that have already been selected in previous cycles will be candidates to be selected in posterior cycles
interaction
Set to either 1 for first order models, or to 2 for second order models
cores
Cores to be used for parallel processing

Value

final.model
An object of class lm, glm, or coxph containing the final model
var.names
A vector with the names of the features that were included in the final model
formula
An object of class formula with the formula used to fit the final model
ranked.var
An array with the ranked frequencies of the features
z.NeRIs
A vector in which each element represents the z-score of the NeRI, associated to the testType, for each feature found in the final model
formula.list
A list containing objects of class formula with the formulas used to fit the models found at each cycle
variableList
A list of variables used in the forward selection

See Also

ForwardSelection.Model.Bin

Examples

Run this code
	## Not run: 
# 	# Start the graphics device driver to save all plots in a pdf format
# 	pdf(file = "Example.pdf")
# 	# Get the stage C prostate cancer data from the rpart package
# 	library(rpart)
# 	data(stagec)
# 	# Split the stages into several columns
# 	dataCancer <- cbind(stagec[,c(1:3,5:6)],
# 	                    gleason4 = 1*(stagec[,7] == 4),
# 	                    gleason5 = 1*(stagec[,7] == 5),
# 	                    gleason6 = 1*(stagec[,7] == 6),
# 	                    gleason7 = 1*(stagec[,7] == 7),
# 	                    gleason8 = 1*(stagec[,7] == 8),
# 	                    gleason910 = 1*(stagec[,7] >= 9),
# 	                    eet = 1*(stagec[,4] == 2),
# 	                    diploid = 1*(stagec[,8] == "diploid"),
# 	                    tetraploid = 1*(stagec[,8] == "tetraploid"),
# 	                    notAneuploid = 1-1*(stagec[,8] == "aneuploid"))
# 	# Remove the incomplete cases
# 	dataCancer <- dataCancer[complete.cases(dataCancer),]
# 	# Load a pre-stablished data frame with the names and descriptions of all variables
# 	data(cancerVarNames)
# 	# Rank the variables:
# 	# - Analyzing the raw data
# 	# - Using a Cox proportional hazards fitting
# 	# - According to the NeRI
# 	rankedDataCancer <- univariateRankVariables(variableList = cancerVarNames,
# 	                                            formula = "Surv(pgtime, pgstat) ~ 1",
# 	                                            Outcome = "pgstat",
# 	                                            data = dataCancer,
# 	                                            categorizationType = "Raw",
# 	                                            type = "COX",
# 	                                            rankingTest = "NeRI",
# 	                                            description = "Description")
# 	# Get a Cox proportional hazards model using:
# 	# - 10 bootstrap loops
# 	# - The ranked variables
# 	# - The Wilcoxon rank-sum test as the feature inclusion criterion
# 	cancerModel <- ForwardSelection.Model.Res(loops = 10,
# 	                                    Outcome = "pgstat",
# 	                                    variableList = rankedDataCancer,
# 	                                    data = dataCancer,
# 	                                    type = "COX",
# 	                                    testType= "Wilcox",
# 	                                    timeOutcome = "pgtime")
# 	# Shut down the graphics device driver
# 	dev.off()## End(Not run)

Run the code above in your browser using DataLab