Learn R Programming

⚠️There's a newer version (3.9.0) of this package.Take me there.

MachineShop: Machine Learning Models and Tools

Overview

MachineShop is a meta-package for statistical and machine learning with a unified interface for model fitting, prediction, performance assessment, and presentation of results. Support is provided for predictive modeling of numerical, categorical, and censored time-to-event outcomes, including those listed in the table below, and for resample (bootstrap, cross-validation, and split training-test sets) estimation of model performance.

Features

  • Unified and concise interface for model fitting, prediction, and performance assessment.
  • Currently supports 49 established models from 25 R packages.
  • Ensemble modeling with stacked regression and super learners.
  • Modeling of response variables types: binary factors, multi-class nominal and ordinal factors, numeric vectors and matrices, and censored time-to-event survival.
  • Model specification with traditional formulas and with flexible pre-processing recipes.
  • Resample estimation of predictive performance, including cross-validation, bootstrap resampling, and split training-test set validation.
  • Parallel execution of resampling algorithms.
  • Choices of performance metrics: accuracy, areas under ROC and precision-recall curves, Brier score, coefficient of determination (R squared), concordance index, cross entropy, F-score, Gini coefficient, unweighted and weighted Cohen’s kappa, mean absolute error, mean squared error, mean squared log error, positive and negative predictive values, and precision and recall.
  • Tabular and graphical performance summaries: calibration curves, confusion matrices, partial dependence plots, lift curves, and variable importance.
  • Model tuning over automatically generated grids of parameter values and random sampling of grid points.
  • Model selection and comparisons for any combination of models and model parameter values.
  • User-definable models and performance

metrics.

Models

Response Variable Types

Method

Constructor

Categorical1

Continuous2

Survival3

Bagging with Classification Trees

AdaBagModel

f

Boosting with Classification Trees

AdaBoostModel

f

Bayesian Additive Regression Trees

BARTModel

f

n

S

Bayesian Additive Regression Trees

BARTMachineModel

b

n

Gradient Boosting with Regression Trees

BlackBoostModel

b

n

S

C5.0 Classification

C50Model

f

Conditional Random Forests

CForestModel

f

n

S

Cox Regression

CoxModel

S

Cox Regression (Stepwise)

CoxStepAICModel

S

Multivariate Adaptive Regression Splines

EarthModel

f

n

Flexible Discriminant Analysis

FDAModel

f

Gradient Boosting with Additive Models

GAMBoostModel

b

n

S

Generalized Boosted Regression

GBMModel

f

n

S

Gradient Boosting with Linear Models

GLMBoostModel

b

n

S

Generalized Linear Models

GLMModel

b

n

Generalized Linear Models (Stepwise)

GLMStepAICModel

b

n

Lasso and Elastic-Net

GLMNetModel

f

m, n

S

K-Nearest Neighbors Model

KNNModel

f, o

n

Least Angle Regression

LARSModel

n

Linear Discriminant Analysis

LDAModel

f

Linear Model

LMModel

f

m, n

Mixture Discriminant Analysis

MDAModel

f

Naive Bayes Classifier

NaiveBayesModel

f

Feed-Forward Neural Networks

NNetModel

f

n

Penalized Discriminant Analysis

PDAModel

f

Partial Least Squares

PLSModel

f

n

Ordered Logistic Regression

POLRModel

o

Quadratic Discriminant Analysis

QDAModel

f

Random Forests

RandomForestModel

f

n

Fast Random Forests

RangerModel

f

n

S

Recursive Partitioning and Regression Trees

RPartModel

f

n

S

Stacked Regression

StackedModel

f, o

m, n

S

Super Learner

SuperModel

f, o

m, n

S

Parametric Survival

SurvRegModel

S

Parametric Survival (Stepwise)

SurvRegStepAICModel

S

Support Vector Machines

SVMModel

f

n

Support Vector Machines (ANOVA)

SVMANOVAModel

f

n

Support Vector Machines (Bessel)

SVMBesselModel

f

n

Support Vector Machines (Laplace)

SVMLaplaceModel

f

n

Support Vector Machines (Linear)

SVMLinearModel

f

n

Support Vector Machines (Poly)

SVMPolyModel

f

n

Support Vector Machines (Radial)

SVMRadialModel

f

n

Support Vector Machines (Spline)

SVMSplineModel

f

n

Support Vector Machines (Tanh)

SVMTanhModel

f

n

Regression and Classification Trees

TreeModel

f

n

Extreme Gradient Boosting

XGBModel

f

n

Extreme Gradient Boosting (DART)

XGBDARTModel

f

n

Extreme Gradient Boosting (Linear)

XGBLinearModel

f

n

Extreme Gradient Boosting (Tree)

XGBTreeModel

f

n

1 b = binary, f = factor, o = ordered

2 m = matrix, n = numeric

3 S = Surv

Installation

# Current release from CRAN
install.packages("MachineShop")

# Development version from GitHub
# install.packages("devtools")
devtools::install_github("brian-j-smith/MachineShop", ref = "develop")

# Development version with vignettes
devtools::install_github("brian-j-smith/MachineShop", ref = "develop", build_vignettes = TRUE)

Documentation

Once the package is installed, general documentation on its usage can be viewed with the following console commands.

library(MachineShop)

# Package help summary
?MachineShop

# Vignette
RShowDoc("Introduction", package = "MachineShop")

Parallel Computing

Resampling algorithms will be executed in parallel automatically if a parallel backend for the foreach package, such as doParallel, is loaded.

library(doParallel)
registerDoParallel(cores = 4)

Example

The following is a brief example illustrating use of the package to predict the species of flowers in Edgar Anderson’s iris data set.

Training and Test Set Analysis

## Load the package
library(MachineShop)
library(magrittr)

## Iris flower species (3 level response) data set
head(iris)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          4.6         3.1          1.5         0.2  setosa
#> 5          5.0         3.6          1.4         0.2  setosa
#> 6          5.4         3.9          1.7         0.4  setosa

## Training and test sets
set.seed(123)
trainindices <- sample(nrow(iris), nrow(iris) * 2 / 3)
train <- iris[trainindices, ]
test <- iris[-trainindices, ]

## Model formula
fo <- Species ~ .

## Models by response type
modelinfo(factor(0)) %>% names
#>  [1] "AdaBagModel"       "AdaBoostModel"     "BARTModel"        
#>  [4] "C50Model"          "CForestModel"      "EarthModel"       
#>  [7] "FDAModel"          "GBMModel"          "GLMNetModel"      
#> [10] "KNNModel"          "LDAModel"          "LMModel"          
#> [13] "MDAModel"          "NaiveBayesModel"   "NNetModel"        
#> [16] "PDAModel"          "PLSModel"          "QDAModel"         
#> [19] "RandomForestModel" "RangerModel"       "RPartModel"       
#> [22] "StackedModel"      "SuperModel"        "SVMModel"         
#> [25] "SVMANOVAModel"     "SVMBesselModel"    "SVMLaplaceModel"  
#> [28] "SVMLinearModel"    "SVMPolyModel"      "SVMRadialModel"   
#> [31] "SVMSplineModel"    "SVMTanhModel"      "TreeModel"        
#> [34] "XGBModel"          "XGBDARTModel"      "XGBLinearModel"   
#> [37] "XGBTreeModel"

## Model-specific information
modelinfo(GBMModel)
#> $GBMModel
#> $GBMModel$label
#> [1] "Generalized Boosted Regression"
#> 
#> $GBMModel$packages
#> [1] "gbm"
#> 
#> $GBMModel$types
#> [1] "factor"  "numeric" "Surv"   
#> 
#> $GBMModel$arguments
#> function (distribution = NULL, n.trees = 100, interaction.depth = 1, 
#>     n.minobsinnode = 10, shrinkage = 0.1, bag.fraction = 0.5) 
#> NULL
#> 
#> $GBMModel$grid
#> [1] TRUE
#> 
#> $GBMModel$varimp
#> [1] TRUE

## Generalized boosted model fit to training set
gbmfit <- fit(fo, data = train, model = GBMModel)

## Variable importance
(vi <- varimp(gbmfit))
#>                 Overall
#> Petal.Length 100.000000
#> Petal.Width   14.601856
#> Sepal.Width    1.438558
#> Sepal.Length   0.000000

plot(vi)
## Test set predicted probabilities
predict(gbmfit, newdata = test, type = "prob") %>% head
#>         setosa   versicolor    virginica
#> [1,] 0.9999737 2.627994e-05 4.493784e-08
#> [2,] 0.9999154 8.456383e-05 4.205211e-09
#> [3,] 0.9999154 8.456383e-05 4.205211e-09
#> [4,] 0.9999737 2.627994e-05 4.493784e-08
#> [5,] 0.9998807 1.192834e-04 2.987679e-09
#> [6,] 0.9999024 9.764241e-05 2.445639e-09

## Test set predicted classifications
predict(gbmfit, newdata = test) %>% head
#> [1] setosa setosa setosa setosa setosa setosa
#> Levels: setosa versicolor virginica

## Test set performance
obs <- response(fo, data = test)
pred <- predict(gbmfit, newdata = test, type = "prob")
performance(obs, pred)
#>  Accuracy     Kappa    ROCAUC     Brier 
#> 0.9200000 0.8793727 0.9837585 0.1502442

Resampling

## Resample estimation of model performance
(res <- resample(fo, data = iris, model = GBMModel, control = CVControl))
#> An object of class "Resamples"
#> 
#> Models: GBMModel
#> Stratification variable: (strata) 
#> 
#> An object from class "MLControl"
#> 
#> Name: CVControl
#> Label: K-Fold Cross-Validation
#> Folds: 10
#> Repeats: 1
#> Seed: 253452646

summary(res)
#>                Mean    Median         SD          Min       Max NA
#> Accuracy 0.95333333 0.9333333 0.03220306 9.333333e-01 1.0000000  0
#> Kappa    0.93000000 0.9000000 0.04830459 9.000000e-01 1.0000000  0
#> ROCAUC   0.98800000 1.0000000 0.02305120 9.333333e-01 1.0000000  0
#> Brier    0.08476969 0.1133233 0.05621648 5.140496e-05 0.1372037  0

plot(res)

Performance Metrics

## Default performance metrics
performance(res) %>% summary
#>                Mean    Median         SD          Min       Max NA
#> Accuracy 0.95333333 0.9333333 0.03220306 9.333333e-01 1.0000000  0
#> Kappa    0.93000000 0.9000000 0.04830459 9.000000e-01 1.0000000  0
#> ROCAUC   0.98800000 1.0000000 0.02305120 9.333333e-01 1.0000000  0
#> Brier    0.08476969 0.1133233 0.05621648 5.140496e-05 0.1372037  0

## All available metric functions
metricinfo() %>% names
#>  [1] "accuracy"        "brier"           "cindex"         
#>  [4] "cross_entropy"   "f_score"         "gini"           
#>  [7] "kappa2"          "mae"             "mse"            
#> [10] "msle"            "npv"             "ppv"            
#> [13] "pr_auc"          "precision"       "r2"             
#> [16] "recall"          "rmse"            "rmsle"          
#> [19] "roc_auc"         "roc_index"       "sensitivity"    
#> [22] "specificity"     "weighted_kappa2"

## Metrics available for resample output
metricinfo(res) %>% names
#> [1] "accuracy"      "brier"         "cross_entropy" "kappa2"       
#> [5] "pr_auc"        "roc_auc"

## User-specified metrics
performance(res, c(accuracy, kappa2)) %>% summary
#>               Mean    Median         SD       Min Max NA
#> accuracy 0.9533333 0.9333333 0.03220306 0.9333333   1  0
#> kappa2   0.9300000 0.9000000 0.04830459 0.9000000   1  0

Model Tuning

## Tune over a grid of model parameters
gbmtune <- tune(fo, data = iris, model = GBMModel,
                grid = expand.grid(n.trees = c(25, 50, 100),
                                   interaction.depth = 1:3,
                                   n.minobsinnode = c(5, 10)))

plot(gbmtune, type = "line")
## Fit the selected model
gbmtunefit <- fit(fo, data = iris, model = gbmtune)
varimp(gbmtunefit)
#>                Overall
#> Petal.Length 100.00000
#> Petal.Width   28.00921
#> Sepal.Width    2.30804
#> Sepal.Length   0.00000

Model Comparisons

## Model comparisons
control <- CVControl(folds = 10, repeats = 5)

gbmres <- resample(fo, data = iris, model = GBMModel(n.tree = 50), control = control)
rfres <- resample(fo, data = iris, model = RandomForestModel(ntree = 50), control = control)
nnetres <- resample(fo, data = iris, model = NNetModel(size = 5), control = control)

res <- Resamples(GBM = gbmres, RF = rfres, NNet = nnetres)
summary(res)
#> , , Accuracy
#> 
#>           Mean    Median         SD Min Max NA
#> GBM  0.9466667 0.9333333 0.05387480 0.8   1  0
#> NNet 0.9373333 1.0000000 0.11773370 0.4   1  0
#> RF   0.9560000 0.9666667 0.05321415 0.8   1  0
#> 
#> , , Kappa
#> 
#>       Mean Median         SD  Min Max NA
#> GBM  0.920   0.90 0.08081220  0.7   1  0
#> NNet 0.902   1.00 0.21236280 -0.3   1  0
#> RF   0.934   0.95 0.07982123  0.7   1  0
#> 
#> , , ROCAUC
#> 
#>           Mean Median         SD       Min Max NA
#> GBM  0.9897333      1 0.01962043 0.9200000   1  0
#> NNet 0.9661333      1 0.10392602 0.3000000   1  0
#> RF   0.9939333      1 0.01405674 0.9266667   1  0
#> 
#> , , Brier
#> 
#>            Mean      Median         SD          Min       Max NA
#> GBM  0.08282749 0.063761882 0.08832055 2.740079e-05 0.3850368  0
#> NNet 0.09151662 0.005049339 0.13523087 1.367731e-51 0.6666668  0
#> RF   0.06990080 0.048453333 0.07184654 1.600000e-04 0.3176000  0

plot(res)
## Pairwise model differences and t-tests
perfdiff <- diff(res)
summary(perfdiff)
#> , , Accuracy
#> 
#>              Mean Median         SD         Min        Max NA
#> GBM - NNet  0.008      0 0.09955912 -0.06666667 0.53333333  0
#> GBM - RF   -0.008      0 0.03198639 -0.06666667 0.06666667  0
#> NNet - RF  -0.016      0 0.11070584 -0.60000000 0.06666667  0
#> 
#> , , Kappa
#> 
#>              Mean Median         SD  Min Max NA
#> GBM - NNet  0.014      0 0.16538144 -0.1 0.9  0
#> GBM - RF   -0.012      0 0.04797959 -0.1 0.1  0
#> NNet - RF  -0.026      0 0.18273433 -1.0 0.1  0
#> 
#> , , ROCAUC
#> 
#>               Mean Median         SD         Min        Max NA
#> GBM - NNet  0.0236      0 0.10167049 -0.06666667 0.67333333  0
#> GBM - RF   -0.0042      0 0.01243231 -0.05333333 0.01666667  0
#> NNet - RF  -0.0278      0 0.10178366 -0.68000000 0.05333333  0
#> 
#> , , Brier
#> 
#>                    Mean        Median        SD         Min       Max NA
#> GBM - NNet -0.008689127  0.0007342389 0.1032869 -0.49047781 0.1318379  0
#> GBM - RF    0.012926694  0.0037022372 0.0339893 -0.04799376 0.1196612  0
#> NNet - RF   0.021615821 -0.0011220887 0.1108912 -0.13237306 0.5181868  0

t.test(perfdiff)
#> An object of class "HTestPerformanceDiff"
#> 
#> Upper diagonal: mean differences (row - column)
#> Lower diagonal: p-values
#> P-value adjustment method: holm
#> 
#> , , Accuracy
#> 
#>            GBM      NNet     RF
#> GBM         NA 0.0080000 -0.008
#> NNet 0.6236368        NA -0.016
#> RF   0.2495969 0.6236368     NA
#> 
#> , , Kappa
#> 
#>            GBM      NNet     RF
#> GBM         NA 0.0140000 -0.012
#> NNet 0.6386268        NA -0.026
#> RF   0.2495969 0.6386268     NA
#> 
#> , , ROCAUC
#> 
#>             GBM      NNet      RF
#> GBM          NA 0.0236000 -0.0042
#> NNet 0.11847985        NA -0.0278
#> RF   0.06239545 0.1184798      NA
#> 
#> , , Brier
#> 
#>             GBM         NNet         RF
#> GBM          NA -0.008689127 0.01292669
#> NNet 0.55467315           NA 0.02161582
#> RF   0.02928367  0.348712945         NA

plot(perfdiff)

Ensemble Models

## Stacked regression
stackedres <- resample(fo, data = iris, model = StackedModel(GBMModel, RandomForestModel, NNetModel))
summary(stackedres)
#>                Mean     Median         SD          Min       Max NA
#> Accuracy 0.95333333 0.96666667 0.05488484 8.666667e-01 1.0000000  0
#> Kappa    0.93000000 0.95000000 0.08232726 8.000000e-01 1.0000000  0
#> ROCAUC   0.99866667 1.00000000 0.00421637 9.866667e-01 1.0000000  0
#> Brier    0.07104383 0.04671106 0.07601605 4.280279e-05 0.1982373  0

## Super learner
superres <- resample(fo, data = iris, model = SuperModel(GBMModel, RandomForestModel, NNetModel))
summary(superres)
#>                Mean    Median         SD          Min       Max NA
#> Accuracy 0.95333333 0.9666667 0.06324555 8.000000e-01 1.0000000  0
#> Kappa    0.93000000 0.9500000 0.09486833 7.000000e-01 1.0000000  0
#> ROCAUC   0.98133333 1.0000000 0.03155243 9.200000e-01 1.0000000  0
#> Brier    0.06817479 0.0322495 0.09179246 4.466662e-06 0.2782804  0

Calibration Curves

cal <- calibration(res)
plot(cal, se = TRUE)

Confusion Matrices

(conf <- confusion(gbmres, cutoff = NULL))
#> GBMModel :
#>             Observed
#> Predicted          setosa   versicolor    virginica
#>   setosa     249.24113696   0.25037765   0.09683350
#>   versicolor   0.74235768 228.85744589  23.18256928
#>   virginica    0.01650536  20.89217646 226.72059721

summary(conf)
#> GBMModel :
#> Number of responses: 750
#> Accuracy (SE): 0.9397589 (0.008688084)
#> Majority class: 0.3333333
#> Kappa: 0.9096384
#> 
#>                setosa versicolor virginica
#> Observed    0.3333333  0.3333333 0.3333333
#> Predicted   0.3327845  0.3370432 0.3301724
#> Agreement   0.3323215  0.3051433 0.3022941
#> Sensitivity 0.9969645  0.9154298 0.9068824
#> Specificity 0.9993056  0.9521501 0.9581826
#> PPV         0.9986089  0.9053537 0.9155646
#> NPV         0.9984835  0.9574783 0.9536609
plot(conf)

Partial Dependence Plots

pd <- dependence(gbmfit, select = c(Petal.Length, Petal.Width))
plot(pd)

Lift Curves

## Requires a binary outcome
fo_versicolor <- factor(Species == "versicolor") ~ .
control = CVControl()

gbmres_versicolor <- resample(fo_versicolor, data = iris,  model = GBMModel, control = control)
lf <- lift(gbmres_versicolor)
plot(lf)
rfres_versicolor <- resample(fo_versicolor, data = iris,  model = RandomForestModel, control = control)
nnetres_versicolor <- resample(fo_versicolor, data = iris,  model = NNetModel, control = control)

res_versicolor <- Resamples(gbmres_versicolor, rfres_versicolor, nnetres_versicolor)
lf <- lift(res_versicolor)
plot(lf, find = 75)

Preprocessing Recipes

library(recipes)

rec <- recipe(fo, data = iris) %>%
  add_role(Species, new_role = "case_strata") %>%
  step_center(all_predictors()) %>%
  step_scale(all_predictors()) %>%
  step_pca(all_predictors())

fit_rec <- fit(rec, model = GBMModel)
varimp(fit_rec)
#>        Overall
#> PC1 100.000000
#> PC3   4.946313
#> PC2   1.089649
#> PC4   0.000000

res_rec <- resample(rec, model = GBMModel, control = CVControl)
summary(res_rec)
#>                Mean     Median         SD         Min       Max NA
#> Accuracy 0.96000000 0.96666667 0.04661373 0.866666667 1.0000000  0
#> Kappa    0.94000000 0.95000000 0.06992059 0.800000000 1.0000000  0
#> ROCAUC   0.99866667 1.00000000 0.00421637 0.986666667 1.0000000  0
#> Brier    0.06446083 0.04011043 0.05647578 0.003933401 0.1399515  0

Copy Link

Version

Install

install.packages('MachineShop')

Monthly Downloads

745

Version

1.1.0

License

GPL-3

Issues

Pull Requests

Stars

Forks

Maintainer

Brian Smith

Last Published

January 23rd, 2019

Functions in MachineShop (1.1.0)

KNNModel

Weighted k-Nearest Neighbor Model
BlackBoostModel

Gradient Boosting with Regression Trees
C50Model

C5.0 Decision Trees and Rule-Based Model
ModelFrame

ModelFrame Class
MachineShop-package

MachineShop: Machine Learning Models and Tools
LARSModel

Least Angle Regression, Lasso and Infinitesimal Forward Stagewise Models
calibration

Model Calibration
TreeModel

Classification and Regression Tree Models
confusion

Confusion Matrix
EarthModel

Multivariate Adaptive Regression Splines Model
FDAModel

Flexible and Penalized Discriminant Analysis Models
GLMNetModel

GLM Lasso or Elasticnet Model
XGBModel

Extreme Gradient Boosting Models
.

Quote Operator
expand.model

Model Expansion Over a Grid of Tuning Parameters
Grid

Tuning Grid Control
modelinfo

Display Model Information
performance

Model Performance Metrics
MLMetric

MLMetric Class Constructor
BARTMachineModel

Bayesian Additive Regression Trees Model
MLModel

MLModel Class Constructor
QDAModel

Quadratic Discriminant Analysis Model
BARTModel

Bayesian Additive Regression Trees Model
GAMBoostModel

Gradient Boosting with Additive Models
RPartModel

Recursive Partitioning and Regression Tree Models
GBMModel

Generalized Boosted Regression Model
metricinfo

Display Performance Metric Information
NNetModel

Neural Network Model
NaiveBayesModel

Naive Bayes Classifier Model
metrics

Performance Metrics
RandomForestModel

Random Forest Model
RangerModel

Fast Random Forest Model
tune

Model Tuning and Selection
varimp

Variable Importance
CForestModel

Conditional Random Forest Model
fit

Model Fitting
CoxModel

Proportional Hazards Regression Model
GLMBoostModel

Gradient Boosting with Linear Models
lift

Model Lift
resample

Resample Estimation of Model Performance
response

Extract Response Variable
GLMModel

Generalized Linear Model
MDAModel

Mixture Discriminant Analysis Model
MLControl

Resampling Controls
SuperModel

Super Learner Model
SurvRegModel

Parametric Survival Model
summary

Model Performance Summary
t.test

Paired t-Tests for Model Comparisons
AdaBagModel

Bagging with Classification Trees
AdaBoostModel

Boosting with Classification Trees
LDAModel

Linear Discriminant Analysis Model
LMModel

Linear Models
PLSModel

Partial Least Squares Model
POLRModel

Ordered Logistic or Probit Regression Model
StackedModel

Stacked Regression Model
SVMModel

Support Vector Machine Models
dependence

Partial Dependence
plot

Model Performance Plots
diff

Model Performance Differences
predict

Model Prediction