get_pdp_predictions: get predictions compatibel with the partial dependence plotting method

Description

Alluvial plots are capable of displaying higher dimensional data on a plane, thus lend themselves to plot the response of a statistical model to changes in the input data across multiple dimensions. The practical limit here is 4 dimensions while conventional partial dependence plots are limited to 2 dimensions.

Briefly the 4 variables with the highest feature importance for a given model are selected and 5 values spread over the variable range are selected for each. Then a grid of all possible combinations is created. All none-plotted variables are set to the values found in the first row of the training data set. Using this artificial data space model predictions are being generated. This process is then repeated for each row in the training data set and the overall model response is averaged in the end. Each of the possible combinations is plotted as a flow which is coloured by the bin corresponding to the average model response generated by that particular combination.

Usage

get_pdp_predictions(df, imp, m, degree = 4, bins = 5, .f_predict = predict)

Arguments

dataframe, training data

imp

dataframe, with not more then two columns one of them numeric containing importance measures and one character or factor column containing corresponding variable names as found in training data.

model object

degree

integer, number of top important variables to select. For plotting more than 4 will result in two many flows and the alluvial plot will not be very readable, Default: 4

bins

integer, number of bins for numeric variables, increasing this number might result in too many flows, Default: 5

.f_predict

corresponding model predict() function. Needs to accept `m` as the first parameter and use the `newdata` parameter. Supply a wrapper for predict functions with x-y synthax.

Value

vector, predictions

Details

see https://christophm.github.io/interpretable-ml-book/pdp.html

Examples

Run this code

# NOT RUN {
 df = mtcars2[, ! names(mtcars2) %in% 'ids' ]
 m = randomForest::randomForest( disp ~ ., df)
 imp = m$importance

 pred = get_pdp_predictions(df, imp
                            , m
                            , degree = 3
                            , bins = 5)

# }