MachineShop: Machine Learning Models and Tools
Overview
MachineShop
is a meta-package for statistical and machine learning
with a unified interface for model fitting, prediction, performance
assessment, and presentation of results. Support is provided for
predictive modeling of numerical, categorical, and censored
time-to-event outcomes, including those listed in the table below, and
for resample (bootstrap, cross-validation, and split training-test sets)
estimation of model performance.
Features
- Unified and concise interface for model fitting, prediction, and performance assessment.
- Currently supports 49 established models from 25 R packages.
- Ensemble modeling with stacked regression and super learners.
- Modeling of response variables types: binary factors, multi-class nominal and ordinal factors, numeric vectors and matrices, and censored time-to-event survival.
- Model specification with traditional formulas and with flexible pre-processing recipes.
- Resample estimation of predictive performance, including cross-validation, bootstrap resampling, and split training-test set validation.
- Parallel execution of resampling algorithms.
- Choices of performance metrics: accuracy, areas under ROC and precision-recall curves, Brier score, coefficient of determination (R squared), concordance index, cross entropy, F-score, Gini coefficient, unweighted and weighted Cohen’s kappa, mean absolute error, mean squared error, mean squared log error, positive and negative predictive values, and precision and recall.
- Tabular and graphical performance summaries: calibration curves, confusion matrices, partial dependence plots, lift curves, and variable importance.
- Model tuning over automatically generated grids of parameter values and random sampling of grid points.
- Model selection and comparisons for any combination of models and model parameter values.
- User-definable models and performance
metrics.
Models
Response Variable Types
Method
Constructor
Categorical1
Continuous2
Survival3
Bagging with Classification Trees
AdaBagModel
f
Boosting with Classification Trees
AdaBoostModel
f
Bayesian Additive Regression Trees
BARTModel
f
n
S
Bayesian Additive Regression Trees
BARTMachineModel
b
n
Gradient Boosting with Regression Trees
BlackBoostModel
b
n
S
C5.0 Classification
C50Model
f
Conditional Random Forests
CForestModel
f
n
S
Cox Regression
CoxModel
S
Cox Regression (Stepwise)
CoxStepAICModel
S
Multivariate Adaptive Regression Splines
EarthModel
f
n
Flexible Discriminant Analysis
FDAModel
f
Gradient Boosting with Additive Models
GAMBoostModel
b
n
S
Generalized Boosted Regression
GBMModel
f
n
S
Gradient Boosting with Linear Models
GLMBoostModel
b
n
S
Generalized Linear Models
GLMModel
b
n
Generalized Linear Models (Stepwise)
GLMStepAICModel
b
n
Lasso and Elastic-Net
GLMNetModel
f
m, n
S
K-Nearest Neighbors Model
KNNModel
f, o
n
Least Angle Regression
LARSModel
n
Linear Discriminant Analysis
LDAModel
f
Linear Model
LMModel
f
m, n
Mixture Discriminant Analysis
MDAModel
f
Naive Bayes Classifier
NaiveBayesModel
f
Feed-Forward Neural Networks
NNetModel
f
n
Penalized Discriminant Analysis
PDAModel
f
Partial Least Squares
PLSModel
f
n
Ordered Logistic Regression
POLRModel
o
Quadratic Discriminant Analysis
QDAModel
f
Random Forests
RandomForestModel
f
n
Fast Random Forests
RangerModel
f
n
S
Recursive Partitioning and Regression Trees
RPartModel
f
n
S
Stacked Regression
StackedModel
f, o
m, n
S
Super Learner
SuperModel
f, o
m, n
S
Parametric Survival
SurvRegModel
S
Parametric Survival (Stepwise)
SurvRegStepAICModel
S
Support Vector Machines
SVMModel
f
n
Support Vector Machines (ANOVA)
SVMANOVAModel
f
n
Support Vector Machines (Bessel)
SVMBesselModel
f
n
Support Vector Machines (Laplace)
SVMLaplaceModel
f
n
Support Vector Machines (Linear)
SVMLinearModel
f
n
Support Vector Machines (Poly)
SVMPolyModel
f
n
Support Vector Machines (Radial)
SVMRadialModel
f
n
Support Vector Machines (Spline)
SVMSplineModel
f
n
Support Vector Machines (Tanh)
SVMTanhModel
f
n
Regression and Classification Trees
TreeModel
f
n
Extreme Gradient Boosting
XGBModel
f
n
Extreme Gradient Boosting (DART)
XGBDARTModel
f
n
Extreme Gradient Boosting (Linear)
XGBLinearModel
f
n
Extreme Gradient Boosting (Tree)
XGBTreeModel
f
n
1 b = binary, f = factor, o = ordered
2 m = matrix, n = numeric
3 S = Surv
Installation
# Current release from CRAN
install.packages("MachineShop")
# Development version from GitHub
# install.packages("devtools")
devtools::install_github("brian-j-smith/MachineShop", ref = "develop")
# Development version with vignettes
devtools::install_github("brian-j-smith/MachineShop", ref = "develop", build_vignettes = TRUE)
Documentation
Once the package is installed, general documentation on its usage can be viewed with the following console commands.
library(MachineShop)
# Package help summary
?MachineShop
# Vignette
RShowDoc("Introduction", package = "MachineShop")
Parallel Computing
Resampling algorithms will be executed in parallel automatically if a
parallel backend for the foreach
package, such as doParallel
, is
loaded.
library(doParallel)
registerDoParallel(cores = 4)
Example
The following is a brief example illustrating use of the package to predict the species of flowers in Edgar Anderson’s iris data set.
Training and Test Set Analysis
## Load the package
library(MachineShop)
library(magrittr)
## Iris flower species (3 level response) data set
head(iris)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5.0 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
## Training and test sets
set.seed(123)
trainindices <- sample(nrow(iris), nrow(iris) * 2 / 3)
train <- iris[trainindices, ]
test <- iris[-trainindices, ]
## Model formula
fo <- Species ~ .
## Models by response type
modelinfo(factor(0)) %>% names
#> [1] "AdaBagModel" "AdaBoostModel" "BARTModel"
#> [4] "C50Model" "CForestModel" "EarthModel"
#> [7] "FDAModel" "GBMModel" "GLMNetModel"
#> [10] "KNNModel" "LDAModel" "LMModel"
#> [13] "MDAModel" "NaiveBayesModel" "NNetModel"
#> [16] "PDAModel" "PLSModel" "QDAModel"
#> [19] "RandomForestModel" "RangerModel" "RPartModel"
#> [22] "StackedModel" "SuperModel" "SVMModel"
#> [25] "SVMANOVAModel" "SVMBesselModel" "SVMLaplaceModel"
#> [28] "SVMLinearModel" "SVMPolyModel" "SVMRadialModel"
#> [31] "SVMSplineModel" "SVMTanhModel" "TreeModel"
#> [34] "XGBModel" "XGBDARTModel" "XGBLinearModel"
#> [37] "XGBTreeModel"
## Model-specific information
modelinfo(GBMModel)
#> $GBMModel
#> $GBMModel$label
#> [1] "Generalized Boosted Regression"
#>
#> $GBMModel$packages
#> [1] "gbm"
#>
#> $GBMModel$types
#> [1] "factor" "numeric" "Surv"
#>
#> $GBMModel$arguments
#> function (distribution = NULL, n.trees = 100, interaction.depth = 1,
#> n.minobsinnode = 10, shrinkage = 0.1, bag.fraction = 0.5)
#> NULL
#>
#> $GBMModel$grid
#> [1] TRUE
#>
#> $GBMModel$varimp
#> [1] TRUE
## Generalized boosted model fit to training set
gbmfit <- fit(fo, data = train, model = GBMModel)
## Variable importance
(vi <- varimp(gbmfit))
#> Overall
#> Petal.Length 100.000000
#> Petal.Width 14.601856
#> Sepal.Width 1.438558
#> Sepal.Length 0.000000
plot(vi)
## Test set predicted probabilities
predict(gbmfit, newdata = test, type = "prob") %>% head
#> setosa versicolor virginica
#> [1,] 0.9999737 2.627994e-05 4.493784e-08
#> [2,] 0.9999154 8.456383e-05 4.205211e-09
#> [3,] 0.9999154 8.456383e-05 4.205211e-09
#> [4,] 0.9999737 2.627994e-05 4.493784e-08
#> [5,] 0.9998807 1.192834e-04 2.987679e-09
#> [6,] 0.9999024 9.764241e-05 2.445639e-09
## Test set predicted classifications
predict(gbmfit, newdata = test) %>% head
#> [1] setosa setosa setosa setosa setosa setosa
#> Levels: setosa versicolor virginica
## Test set performance
obs <- response(fo, data = test)
pred <- predict(gbmfit, newdata = test, type = "prob")
performance(obs, pred)
#> Accuracy Kappa ROCAUC Brier
#> 0.9200000 0.8793727 0.9837585 0.1502442
Resampling
## Resample estimation of model performance
(res <- resample(fo, data = iris, model = GBMModel, control = CVControl))
#> An object of class "Resamples"
#>
#> Models: GBMModel
#> Stratification variable: (strata)
#>
#> An object from class "MLControl"
#>
#> Name: CVControl
#> Label: K-Fold Cross-Validation
#> Folds: 10
#> Repeats: 1
#> Seed: 253452646
summary(res)
#> Mean Median SD Min Max NA
#> Accuracy 0.95333333 0.9333333 0.03220306 9.333333e-01 1.0000000 0
#> Kappa 0.93000000 0.9000000 0.04830459 9.000000e-01 1.0000000 0
#> ROCAUC 0.98800000 1.0000000 0.02305120 9.333333e-01 1.0000000 0
#> Brier 0.08476969 0.1133233 0.05621648 5.140496e-05 0.1372037 0
plot(res)
Performance Metrics
## Default performance metrics
performance(res) %>% summary
#> Mean Median SD Min Max NA
#> Accuracy 0.95333333 0.9333333 0.03220306 9.333333e-01 1.0000000 0
#> Kappa 0.93000000 0.9000000 0.04830459 9.000000e-01 1.0000000 0
#> ROCAUC 0.98800000 1.0000000 0.02305120 9.333333e-01 1.0000000 0
#> Brier 0.08476969 0.1133233 0.05621648 5.140496e-05 0.1372037 0
## All available metric functions
metricinfo() %>% names
#> [1] "accuracy" "brier" "cindex"
#> [4] "cross_entropy" "f_score" "gini"
#> [7] "kappa2" "mae" "mse"
#> [10] "msle" "npv" "ppv"
#> [13] "pr_auc" "precision" "r2"
#> [16] "recall" "rmse" "rmsle"
#> [19] "roc_auc" "roc_index" "sensitivity"
#> [22] "specificity" "weighted_kappa2"
## Metrics available for resample output
metricinfo(res) %>% names
#> [1] "accuracy" "brier" "cross_entropy" "kappa2"
#> [5] "pr_auc" "roc_auc"
## User-specified metrics
performance(res, c(accuracy, kappa2)) %>% summary
#> Mean Median SD Min Max NA
#> accuracy 0.9533333 0.9333333 0.03220306 0.9333333 1 0
#> kappa2 0.9300000 0.9000000 0.04830459 0.9000000 1 0
Model Tuning
## Tune over a grid of model parameters
gbmtune <- tune(fo, data = iris, model = GBMModel,
grid = expand.grid(n.trees = c(25, 50, 100),
interaction.depth = 1:3,
n.minobsinnode = c(5, 10)))
plot(gbmtune, type = "line")
## Fit the selected model
gbmtunefit <- fit(fo, data = iris, model = gbmtune)
varimp(gbmtunefit)
#> Overall
#> Petal.Length 100.00000
#> Petal.Width 28.00921
#> Sepal.Width 2.30804
#> Sepal.Length 0.00000
Model Comparisons
## Model comparisons
control <- CVControl(folds = 10, repeats = 5)
gbmres <- resample(fo, data = iris, model = GBMModel(n.tree = 50), control = control)
rfres <- resample(fo, data = iris, model = RandomForestModel(ntree = 50), control = control)
nnetres <- resample(fo, data = iris, model = NNetModel(size = 5), control = control)
res <- Resamples(GBM = gbmres, RF = rfres, NNet = nnetres)
summary(res)
#> , , Accuracy
#>
#> Mean Median SD Min Max NA
#> GBM 0.9466667 0.9333333 0.05387480 0.8 1 0
#> NNet 0.9373333 1.0000000 0.11773370 0.4 1 0
#> RF 0.9560000 0.9666667 0.05321415 0.8 1 0
#>
#> , , Kappa
#>
#> Mean Median SD Min Max NA
#> GBM 0.920 0.90 0.08081220 0.7 1 0
#> NNet 0.902 1.00 0.21236280 -0.3 1 0
#> RF 0.934 0.95 0.07982123 0.7 1 0
#>
#> , , ROCAUC
#>
#> Mean Median SD Min Max NA
#> GBM 0.9897333 1 0.01962043 0.9200000 1 0
#> NNet 0.9661333 1 0.10392602 0.3000000 1 0
#> RF 0.9939333 1 0.01405674 0.9266667 1 0
#>
#> , , Brier
#>
#> Mean Median SD Min Max NA
#> GBM 0.08282749 0.063761882 0.08832055 2.740079e-05 0.3850368 0
#> NNet 0.09151662 0.005049339 0.13523087 1.367731e-51 0.6666668 0
#> RF 0.06990080 0.048453333 0.07184654 1.600000e-04 0.3176000 0
plot(res)
## Pairwise model differences and t-tests
perfdiff <- diff(res)
summary(perfdiff)
#> , , Accuracy
#>
#> Mean Median SD Min Max NA
#> GBM - NNet 0.008 0 0.09955912 -0.06666667 0.53333333 0
#> GBM - RF -0.008 0 0.03198639 -0.06666667 0.06666667 0
#> NNet - RF -0.016 0 0.11070584 -0.60000000 0.06666667 0
#>
#> , , Kappa
#>
#> Mean Median SD Min Max NA
#> GBM - NNet 0.014 0 0.16538144 -0.1 0.9 0
#> GBM - RF -0.012 0 0.04797959 -0.1 0.1 0
#> NNet - RF -0.026 0 0.18273433 -1.0 0.1 0
#>
#> , , ROCAUC
#>
#> Mean Median SD Min Max NA
#> GBM - NNet 0.0236 0 0.10167049 -0.06666667 0.67333333 0
#> GBM - RF -0.0042 0 0.01243231 -0.05333333 0.01666667 0
#> NNet - RF -0.0278 0 0.10178366 -0.68000000 0.05333333 0
#>
#> , , Brier
#>
#> Mean Median SD Min Max NA
#> GBM - NNet -0.008689127 0.0007342389 0.1032869 -0.49047781 0.1318379 0
#> GBM - RF 0.012926694 0.0037022372 0.0339893 -0.04799376 0.1196612 0
#> NNet - RF 0.021615821 -0.0011220887 0.1108912 -0.13237306 0.5181868 0
t.test(perfdiff)
#> An object of class "HTestPerformanceDiff"
#>
#> Upper diagonal: mean differences (row - column)
#> Lower diagonal: p-values
#> P-value adjustment method: holm
#>
#> , , Accuracy
#>
#> GBM NNet RF
#> GBM NA 0.0080000 -0.008
#> NNet 0.6236368 NA -0.016
#> RF 0.2495969 0.6236368 NA
#>
#> , , Kappa
#>
#> GBM NNet RF
#> GBM NA 0.0140000 -0.012
#> NNet 0.6386268 NA -0.026
#> RF 0.2495969 0.6386268 NA
#>
#> , , ROCAUC
#>
#> GBM NNet RF
#> GBM NA 0.0236000 -0.0042
#> NNet 0.11847985 NA -0.0278
#> RF 0.06239545 0.1184798 NA
#>
#> , , Brier
#>
#> GBM NNet RF
#> GBM NA -0.008689127 0.01292669
#> NNet 0.55467315 NA 0.02161582
#> RF 0.02928367 0.348712945 NA
plot(perfdiff)
Ensemble Models
## Stacked regression
stackedres <- resample(fo, data = iris, model = StackedModel(GBMModel, RandomForestModel, NNetModel))
summary(stackedres)
#> Mean Median SD Min Max NA
#> Accuracy 0.95333333 0.96666667 0.05488484 8.666667e-01 1.0000000 0
#> Kappa 0.93000000 0.95000000 0.08232726 8.000000e-01 1.0000000 0
#> ROCAUC 0.99866667 1.00000000 0.00421637 9.866667e-01 1.0000000 0
#> Brier 0.07104383 0.04671106 0.07601605 4.280279e-05 0.1982373 0
## Super learner
superres <- resample(fo, data = iris, model = SuperModel(GBMModel, RandomForestModel, NNetModel))
summary(superres)
#> Mean Median SD Min Max NA
#> Accuracy 0.95333333 0.9666667 0.06324555 8.000000e-01 1.0000000 0
#> Kappa 0.93000000 0.9500000 0.09486833 7.000000e-01 1.0000000 0
#> ROCAUC 0.98133333 1.0000000 0.03155243 9.200000e-01 1.0000000 0
#> Brier 0.06817479 0.0322495 0.09179246 4.466662e-06 0.2782804 0
Calibration Curves
cal <- calibration(res)
plot(cal, se = TRUE)
Confusion Matrices
(conf <- confusion(gbmres, cutoff = NULL))
#> GBMModel :
#> Observed
#> Predicted setosa versicolor virginica
#> setosa 249.24113696 0.25037765 0.09683350
#> versicolor 0.74235768 228.85744589 23.18256928
#> virginica 0.01650536 20.89217646 226.72059721
summary(conf)
#> GBMModel :
#> Number of responses: 750
#> Accuracy (SE): 0.9397589 (0.008688084)
#> Majority class: 0.3333333
#> Kappa: 0.9096384
#>
#> setosa versicolor virginica
#> Observed 0.3333333 0.3333333 0.3333333
#> Predicted 0.3327845 0.3370432 0.3301724
#> Agreement 0.3323215 0.3051433 0.3022941
#> Sensitivity 0.9969645 0.9154298 0.9068824
#> Specificity 0.9993056 0.9521501 0.9581826
#> PPV 0.9986089 0.9053537 0.9155646
#> NPV 0.9984835 0.9574783 0.9536609
plot(conf)
Partial Dependence Plots
pd <- dependence(gbmfit, select = c(Petal.Length, Petal.Width))
plot(pd)
Lift Curves
## Requires a binary outcome
fo_versicolor <- factor(Species == "versicolor") ~ .
control = CVControl()
gbmres_versicolor <- resample(fo_versicolor, data = iris, model = GBMModel, control = control)
lf <- lift(gbmres_versicolor)
plot(lf)
rfres_versicolor <- resample(fo_versicolor, data = iris, model = RandomForestModel, control = control)
nnetres_versicolor <- resample(fo_versicolor, data = iris, model = NNetModel, control = control)
res_versicolor <- Resamples(gbmres_versicolor, rfres_versicolor, nnetres_versicolor)
lf <- lift(res_versicolor)
plot(lf, find = 75)
Preprocessing Recipes
library(recipes)
rec <- recipe(fo, data = iris) %>%
add_role(Species, new_role = "case_strata") %>%
step_center(all_predictors()) %>%
step_scale(all_predictors()) %>%
step_pca(all_predictors())
fit_rec <- fit(rec, model = GBMModel)
varimp(fit_rec)
#> Overall
#> PC1 100.000000
#> PC3 4.946313
#> PC2 1.089649
#> PC4 0.000000
res_rec <- resample(rec, model = GBMModel, control = CVControl)
summary(res_rec)
#> Mean Median SD Min Max NA
#> Accuracy 0.96000000 0.96666667 0.04661373 0.866666667 1.0000000 0
#> Kappa 0.94000000 0.95000000 0.06992059 0.800000000 1.0000000 0
#> ROCAUC 0.99866667 1.00000000 0.00421637 0.986666667 1.0000000 0
#> Brier 0.06446083 0.04011043 0.05647578 0.003933401 0.1399515 0