C5_rules: General Interface for C5.0 Rule-Based Classification Models

Description

C5_rules() is a way to generate a specification of a model before fitting. The main arguments for the model are:

trees: The number of sequential models included in the ensemble (rules are derived from an initial set of boosted trees).
min_n: The minimum number of data points in a node that are required for the node to be split further.

These arguments are converted to their specific names at the time that the model is fit. Other options and argument can be set using parsnip::set_engine(). If left to their defaults here (NULL), the values are taken from the underlying model functions. If parameters need to be modified, update() can be used in lieu of recreating the object from scratch.

Usage

C5_rules(mode = "classification", trees = NULL, min_n = NULL)
# S3 method for C5_rules
update(
  object,
  parameters = NULL,
  trees = NULL,
  min_n = NULL,
  fresh = FALSE,
  ...
)

Arguments

mode

A single character string for the type of model. The only possible value for this model is "classification".

trees

A non-negative integer (no greater than 100 for the number of members of the ensemble.

min_n

An integer greater than one zero and nine for the minimum number of data points in a node that are required for the node to be split further.

object

A C5_rules model specification.

parameters

A 1-row tibble or named list with main parameters to update. If the individual arguments are used, these will supersede the values in parameters. Also, using engine arguments in this object will result in an error.

fresh

A logical for whether the arguments should be modified in-place or replaced wholesale.

...

Not used for update().

Value

An updated parsnip model specification.

Details

C5.0 is a classification model that is an extension of the C4.5 model of Quinlan (1993). It has tree- and rule-based versions that also include boosting capabilities. C5_rules() enables the version of the model that uses a series of rules (see the examples below). To make a set of rules, an initial C5.0 tree is created and flattened into rules. The rules are pruned, simplified, and ordered. Rule sets are created within each iteration of boosting.

The two main tuning parameters are the number of trees in the boosting ensemble (trees) and the number of samples required to continue splitting when creating a tree (min_n). There are no arguments to control the total number of rules in the ensemble.

Note that C5_rules() does not require that categorical predictors be converted to numeric indicator values. Note that using parsnip::fit() will always create dummy variables so, if there is interest in keeping the categorical predictors in their original format, parsnip::fit_xy() would be a better choice. When using the tune package, using a recipe for pre-processing enables more control over how such predictors are encoded since recipes do not automatically create dummy variables.

Note that C5.0 has a tool for early stopping during boosting where less iterations of boosting are performed than the number requested. C5_rules() turns this feature off (although it can be re-enabled using C50::C5.0Control()).

References

Quinlan R (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers.

Examples

Run this code

# NOT RUN {
C5_rules()
# Parameters can be represented by a placeholder:
C5_rules(trees = 7)

# ------------------------------------------------------------------------------

data(ad_data, package = "modeldata")

set.seed(282782)
class_rules <-
  C5_rules(trees = 1, min_n  = 10) %>%
  fit(Class ~ ., data = ad_data)

summary(class_rules$fit)

# ------------------------------------------------------------------------------

model <- C5_rules(trees = 10, min_n = 2)
model
update(model, trees = 1)
update(model, trees = 1, fresh = TRUE)
# }

Run the code above in your browser using DataLab