blkbox (version 1.0)

blkboxCV: k-fold cross validation with blkbox.

Description

A function that builds upon the blkbox function and performs k-fold cross validation and then provides votes for each fold as well as the importance of each feature in the models.

Usage

blkboxCV(data, labels, folds = 10, seed, ntrees, mTry, repeats = 1, Kernel, Gamma, max.depth, xgtype = "binary:logistic", exclude = c(0), Method = "GLM", AUC = "NA")

Arguments

data
A data.frame where the columns correspond to features and the rows are samples. The dataframe will be shuffled and split into k folds for downstream analysis.
labels
A character or numeric vector of the class identifiers that each sample belongs.
folds
The number of times the data set will be subsectioned (number of samples / k, if modulo exists the groups will be as close to the same size as possible). Each data subsection will be used as a holdout portion. default = 10.
seed
A numeric value. defaults to a randomly generated set of seeds that are output when run starts.
ntrees
The number of trees used in the ensemble based learners (randomforest, bigrf, party, bartmachine). default = 500.
mTry
The number of features sampled at each node in the trees of ensemble based learners (randomforest, bigrf, party, bartmachine). default = sqrt(number of features).
repeats
repeat the cross validation process. default = 1.
Kernel
The type of kernel used in the support vector machine algorithm (linear, radial, sigmoid, polynomial). default = "linear".
Gamma
Advanced parameter, defines the distance of which a single training example reaches. Low gamma will produce a SVM with softer boundaries, as Gamma increases the boundaries will eventually become restricted to their singular support vector. default is 1/(ncol - 1).
max.depth
the maximum depth of the tree in xgboost model, default is sqrt(ncol(data)).
xgtype
either "binary:logistic" or "reg:linear" for logistic regression or linear regression respectively.
exclude
removes certain algorithms from analysis - to exclude random forest which you would set exclude = "randomforest". The algorithms each have their own numeric identifier. randomforest = "randomforest", knn = "kknn", bartmachine = "bartmachine", party = "party", glmnet = "GLM", pam = "PamR, nnet = "nnet", svm = "SVM", xgboost = "xgboost".
Method
The algorithm used to feature select the data. Uses the feature importance from the algorithms to rank and remove anything below the AUC threshold. Default is "GLM".
AUC
Area under the curve selection measure. The relative importance of features is calculated and then ranked. The features responsible for the most importance are therefore desired, the AUC value is the percentile in which to keep features above. 0.5 keeps the highest ranked features responsible for 50 percent of the cumulative importance. Default is NA which means feature are not selected at after CV. Will default to 1.0 if Method is "xgboost".

Examples

Run this code

model_2 <- blkboxCV(data = my_data, labels = my_labels)

Run the code above in your browser using DataCamp Workspace