XGBoostDevelopment: Compare predictive models, created on your data

Description

This step allows you to create an XGBoost classification model, based on your data. Use model type 'multiclass' with 2 or more classes. XGBoost is an ensemble model, well suited to non-linear data and very fast. Can be parameter-dependent.

Usage

XGBoostDevelopment(type, df, grainCol, predictedCol, 
impute, debug, cores, modelName, xgb_params, xgb_nrounds)

Arguments

type

The type of model. Currently requires 'multiclass'.

Dataframe whose columns are used for calc.

grainCol

Optional. The dataframe's column that has IDs pertaining to the grain. No ID columns are truly needed for this step.

predictedCol

Column that you want to predict. If you're doing classification then this should be Y/N.

impute

Set all-column imputation to T or F. If T, this uses mean replacement for numeric columns and most frequent for factorized columns. F leads to removal of rows containing NULLs. Values are saved for deployment.

debug

Provides the user extended output to the console, in order to monitor the calculations throughout. Use T or F.

cores

Number of cores you'd like to use. Defaults to 2.

modelName

Optional string. Can specify the model name. If used, you must load the same one in the deploy step.

xgb_params

A list, containing optional xgboost parameters. The full list of params can be found at http://xgboost.readthedocs.io/en/latest/parameter.html.

xgb_nrounds

Number of rounds to use for boosting.

Format

An object of class R6ClassGenerator of length 24.

Methods

The above describes params for initializing a new XGBoostDevelopment class with $new(). Individual methods are documented below.

<code>$new()</code>

Initializes a new XGBoost development class using the parameters saved in p, documented above. This method loads, cleans, and prepares data for model training. Usage: $new(p)

<code>$run()</code>

Trains model, displays predictions and class-wise performance. Usage: $new()

<code>$getPredictions()</code>

Returns the predictions from test data. Usage: $getPredictions()

<code>$generateConfusionMatrix()</code>

Returns the confusion matrix and statistics generated during model development. Usage: $getConfusionMatrix()

References

http://healthcareai-r.readthedocs.io

Examples

Run this code

# NOT RUN {
#### Example using csv dataset ####
ptm <- proc.time()
library(healthcareai)

# 1. Load data. Categorical columns should be characters.
csvfile <- system.file("extdata", 
                      "dermatology_multiclass_data.csv", 
                      package = "healthcareai")

 # Replace csvfile with 'path/file'
df <- read.csv(file = csvfile, 
              header = TRUE, 
              stringsAsFactors = FALSE,
              na.strings = c("NULL", "NA", "", "?"))

str(df) # check the types of columns

# 2. Develop and save model
set.seed(42)
p <- SupervisedModelDevelopmentParams$new()
p$df <- df
p$type <- "multiclass"
p$impute <- TRUE
p$grainCol <- "PatientID"
p$predictedCol <- "target"
p$debug <- FALSE
p$cores <- 1
# xgb_params must be a list with all of these things in it. 
# if you would like to tweak parameters, go for it! 
# Leave objective and eval_metric as they are.
p$xgb_params <- list("objective" = "multi:softprob",
                  "eval_metric" = "mlogloss",
                  "max_depth" = 6, # max depth of each learner
                  "eta" = 0.1, # learning rate
                  "silent" = 0, # verbose output when set to 1
                  "nthread" = 2) # number of processors to use

# Run model
boost <- XGBoostDevelopment$new(p)
boost$run()

# Get output data 
outputDF <- boost$getPredictions()
head(outputDF)

print(proc.time() - ptm)

# }

Run the code above in your browser using DataLab