Cubist is a rule-based ensemble regression model. A basic model tree
(Quinlan, 1992) is created that has a separate linear regression model
corresponding for each terminal node. The paths along the model tree is
flattened into rules these rules are simplified and pruned. The parameter
min_n
is the primary method for controlling the size of each tree while
max_rules
controls the number of rules.
Cubist ensembles are created using committees, which are similar to
boosting. After the first model in the committee is created, the second
model uses a modified version of the outcome data based on whether the
previous model under- or over-predicted the outcome. For iteration m, the
new outcome y*
is computed using
If a sample is under-predicted on the previous iteration, the outcome is
adjusted so that the next time it is more likely to be over-predicted to
compensate. This adjustment continues for each ensemble iteration. See
Kuhn and Johnson (2013) for details.
After the model is created, there is also an option for a post-hoc
adjustment that uses the training set (Quinlan, 1993). When a new sample is
predicted by the model, it can be modified by its nearest neighbors in the
original training set. For K neighbors, the model based predicted value is
adjusted by the neighbor using:
where t
is the training set prediction and w
is a weight that is inverse
to the distance to the neighbor.
Note that cubist_rules()
does not require that categorical predictors be
converted to numeric indicator values. Note that using parsnip::fit()
will
always create dummy variables so, if there is interest in keeping the
categorical predictors in their original format, parsnip::fit_xy()
would
be a better choice. When using the tune
package, using a recipe for
pre-processing enables more control over how such predictors are encoded
since recipes do not automatically create dummy variables.
The only available engine is "Cubist"
.