This function performs a cross-validation analysis of a feature selection algorithm based on net residual improvement (NeRI) to return a predictive model. It is composed of a NeRI-based feature selection followed by an update procedure, ending with a bootstrapping backwards feature elimination. The user can control how many train and blind test sets will be evaluated.
crossValidationFeatureSelection_Res(size = 10,
fraction = 1.0,
pvalue = 0.05,
loops = 100,
covariates = "1",
Outcome,
timeOutcome = "Time",
variableList,
data,
maxTrainModelSize = 20,
type = c("LM", "LOGIT", "COX"),
testType = c("Binomial",
"Wilcox",
"tStudent",
"Ftest"),
startOffset = 0,
elimination.bootstrap.steps = 100,
trainFraction = 0.67,
trainRepetition = 9,
setIntersect = 1,
unirank = NULL,
print=TRUE,
plots=TRUE,
lambda="lambda.1se",
equivalent=FALSE,
bswimsCycles=10,
usrFitFun=NULL,
featureSize=0)
The number of candidate variables to be tested (the first size
variables from variableList
)
The fraction of data (sampled with replacement) to be used as train
The maximum p-value, associated to the NeRI, allowed for a term in the model
The number of bootstrap loops
A string of the type "1 + var1 + var2" that defines which variables will always be included in the models (as covariates)
The name of the column in data
that stores the variable to be predicted by the model
The name of the column in data
that stores the time to event (needed only for a Cox proportional hazards regression model fitting)
A data frame with two columns. The first one must have the names of the candidate variables and the other one the description of such variables
A data frame where all variables are stored in different columns
Maximum number of terms that can be included in the model
Fit type: Logistic ("LOGIT"), linear ("LM"), or Cox proportional hazards ("COX")
Type of non-parametric test to be evaluated by the improvedResiduals
function: Binomial test ("Binomial"), Wilcoxon rank-sum test ("Wilcox"), Student's t-test ("tStudent"), or F-test ("Ftest")
Only terms whose position in the model is larger than the startOffset
are candidates to be removed
The number of bootstrap loops for the backwards elimination procedure
The fraction of data (sampled with replacement) to be used as train for the cross-validation procedure
The intersect of the model (To force a zero intersect, set this value to 0)
The number of cross-validation folds (it should be at least equal to \(1/trainFraction\) for a complete cross-validation)
A list with the results yielded by the uniRankVar
function, required only if the rank needs to be updated during the cross-validation procedure
Logical. If TRUE
, information will be displayed
Logical. If TRUE
, plots are displayed
The passed value to the s parameter of the glmnet cross validation coefficient
Is set to TRUE CV will compute the equivalent model
The maximum number of models to be returned by BSWiMS.model
A user fitting function to be evaluated by the cross validation procedure
The original number of features to be explored in the data frame.
A list containing objects of class formula
with the formulas used to fit the models found at each cycle
A data frame with the blind test set predictions made at each fold of the cross validation (Full B:SWiMS,Median,Bagged,Forward,Backward Elimination), where the models used to generate such predictions (formula.list
) were generated via a feature selection process which included only the train set.
It also includes a column with the Outcome
of each prediction, and a column with the number of the fold at which the prediction was made.
A data frame similar to Models.testPrediction
, but where the model used to generate the predictions was the Full model, generated via a feature selection process which included all data.
A list containing the values returned by bootstrapVarElimination_Res
using all data and the model from updatedforwardModel
A list containing the values returned by ForwardSelection.Model.Res
using all data
A list containing the values returned by updateModel.Res
using all data and the model from forwardSelection
The global blind test root-mean-square error (RMSE) of the cross-validation procedure
The global blind test Pearson r product-moment correlation coefficient of the cross-validation procedure
The global blind test Spearman \(\rho\) rank correlation coefficient of the cross-validation procedure
The global blind test RMSE of the Full model
The global blind test Pearson r product-moment correlation coefficient of the Full model
The global blind test Spearman \(\rho\) rank correlation coefficient of the Full model
The train RMSE at each fold of the cross-validation procedure
The train Pearson r product-moment correlation coefficient at each fold of the cross-validation procedure
The train Spearman \(\rho\) rank correlation coefficient at each fold of the cross-validation procedure
The train RMSE of the Full model at each fold of the cross-validation procedure
The train Pearson r product-moment correlation coefficient of the Full model at each fold of the cross-validation procedure
The train Spearman \(\rho\) rank correlation coefficient of the Full model at each fold of the cross-validation procedure
The blind test RMSE at each fold of the cross-validation procedure
The blind test RMSE of the Full model at each fold of the cross-validation procedure
An object of class cv.glmnet
containing the results of an elastic net cross-validation fit
A data frame similar to Models.testPrediction
, but where the predictions were made by the elastic net model
A list with the elastic net Full model and the models found at each cross-validation fold
A vector with the Mean Square error for each blind fold
A vector with the Spearman correlation between prediction and outcome for each blind fold
A vector with the Pearson correlation between prediction and outcome for each blind fold
A vector with the C-index (Somers' Dxy rank correlation :rcorr.cens
) between prediction and outcome for each blind fold
A vector with the Pearson correlation between the outcome and prediction for each repeated experiment
A vector with the Spearm correlation between the outcome and prediction for each repeated experiment
A vector with the RMS between the outcome and prediction for each repeated experiment
A data frame with the outcome and the train prediction of every model
A data frame with the outcome and the train prediction at each CV fold for the main model
A data frame with the outcome and the prediction of each enet lasso model
A data frame with mean square of the train residuals from the univariate models of the model terms
A data frame with mean square of the test residuals of the univariate models of the model terms
The ensemble prediction by all models on the test data
The list of formulas with "optimal" performance
The list of formulas produced by the forward procedure
The list of the bagged models
The list of variables used by LASSO fitting
This function produces a set of data and plots that can be used to inspect the degree of over-fitting or shrinkage of a model. It uses bootstrapped data, cross-validation data, and, if possible, retrain data.
crossValidationFeatureSelection_Bin,
improvedResiduals,
bootstrapVarElimination_Res