blkboxCV: k-fold cross validation with blkbox.

Description

A function that builds upon the blkbox function and performs k-fold cross validation and then provides votes for each fold as well as the importance of each feature in the models.

Usage

blkboxCV(data, labels, folds = 10, seed, ntrees, mTry, repeats = 1, Kernel, Gamma, max.depth, xgtype = "binary:logistic", exclude = c(0), Method = "GLM", AUC = "NA")

Arguments

data

A data.frame where the columns correspond to features and the rows are samples. The dataframe will be shuffled and split into k folds for downstream analysis.

labels

A character or numeric vector of the class identifiers that each sample belongs.

folds

The number of times the data set will be subsectioned (number of samples / k, if modulo exists the groups will be as close to the same size as possible). Each data subsection will be used as a holdout portion. default = 10.

seed

A numeric value. defaults to a randomly generated set of seeds that are output when run starts.

ntrees

The number of trees used in the ensemble based learners (randomforest, bigrf, party, bartmachine). default = 500.

mTry

The number of features sampled at each node in the trees of ensemble based learners (randomforest, bigrf, party, bartmachine). default = sqrt(number of features).

repeats

repeat the cross validation process. default = 1.

Kernel

The type of kernel used in the support vector machine algorithm (linear, radial, sigmoid, polynomial). default = "linear".

Gamma

Advanced parameter, defines the distance of which a single training example reaches. Low gamma will produce a SVM with softer boundaries, as Gamma increases the boundaries will eventually become restricted to their singular support vector. default is 1/(ncol - 1).

max.depth

the maximum depth of the tree in xgboost model, default is sqrt(ncol(data)).

xgtype

either "binary:logistic" or "reg:linear" for logistic regression or linear regression respectively.

exclude

removes certain algorithms from analysis - to exclude random forest which you would set exclude = "randomforest". The algorithms each have their own numeric identifier. randomforest = "randomforest", knn = "kknn", bartmachine = "bartmachine", party = "party", glmnet = "GLM", pam = "PamR, nnet = "nnet", svm = "SVM", xgboost = "xgboost".

Method

The algorithm used to feature select the data. Uses the feature importance from the algorithms to rank and remove anything below the AUC threshold. Default is "GLM".

AUC

Area under the curve selection measure. The relative importance of features is calculated and then ranked. The features responsible for the most importance are therefore desired, the AUC value is the percentile in which to keep features above. 0.5 keeps the highest ranked features responsible for 50 percent of the cumulative importance. Default is NA which means feature are not selected at after CV. Will default to 1.0 if Method is "xgboost".

Examples

Run this code


model_2 <- blkboxCV(data = my_data, labels = my_labels)

Run the code above in your browser using DataLab