light_profile: Partial Dependence and other Profiles

Description

Calculates different types of profiles across covariable values. By default, partial dependence profiles [1] are calculated. Other options are profiles of ALE (accumulated local effects, see [2]), response, predicted values ("M plots" or "marginal plots", see [2]) and residuals. The results are aggregated either by (weighted) means or by (weighted) quartiles. Note that ALE profiles are calibrated by (weighted) average predictions. In contrast to the suggestions in [2], we calculate ALE profiles of factors in the same order as the factor levels. They are not being reordered based on similiarity of other variables.

Usage

light_profile(x, ...)
# S3 method for default
light_profile(x, ...)
# S3 method for flashlight
light_profile(x, v = NULL, data = NULL,
  by = x$by, type = c("partial dependence", "ale", "predicted",
  "response", "residual"), stats = c("mean", "quartiles"),
  breaks = NULL, n_bins = 11, cut_type = c("equal", "quantile"),
  use_linkinv = TRUE, value_name = "value", q1_name = "q1",
  q3_name = "q3", label_name = "label", type_name = "type",
  counts_name = "counts", counts = TRUE, counts_weighted = FALSE,
  v_labels = TRUE, pred = NULL, pd_evaluate_at = NULL,
  pd_grid = NULL, pd_indices = NULL, pd_n_max = 1000,
  pd_seed = NULL, pd_center = FALSE, ale_two_sided = FALSE, ...)
# S3 method for multiflashlight
light_profile(x, v = NULL, data = NULL,
  breaks = NULL, n_bins = 11, cut_type = c("equal", "quantile"),
  pd_evaluate_at = NULL, pd_grid = NULL, ...)

Arguments

An object of class flashlight or multiflashlight.

...

Further arguments passed to cut3 resp. formatC in forming the cut breaks of the v variable. Not relevant for partial dependence and ALE profiles.

The variable to be profiled.

data

An optional data.frame.

An optional vector of column names used to additionally group the results.

type

Type of the profile: Either "partial dependence", "ale", "predicted", "response", or "residual".

stats

Statistic to calculate: "mean" or "quartiles". For ALE profiles, only "mean" makes sense.

breaks

Cut breaks for a numeric v.

n_bins

Maxmium number of unique values to evaluate for numeric v. Only used if neither grid nor pd_evaluate_at is specified.

cut_type

For the default "equal", bins of equal width are created for v by pretty. Choose "quantile" to create quantile bins.

use_linkinv

Should retransformation function be applied? Default is TRUE.

value_name

Column name in resulting data containing the profile value. Defaults to "value".

q1_name

Name of the resulting column with first quartile values. Only relevant for stats "quartiles".

q3_name

Name of the resulting column with third quartile values. Only relevant for stats "quartiles".

label_name

Column name in resulting data containing the label of the flashlight. Defaults to "label".

type_name

Column name in the resulting data with the plot type.

counts_name

Name of the column containing counts if counts is TRUE.

counts

Should counts be added?

counts_weighted

If counts is TRUE: Should counts be weighted by the case weights? If TRUE, the sum of w is returned by group.

v_labels

If FALSE, return group centers of v instead of labels. Only relevant for types "response", "predicted" or "residual" and if v is being binned. In that case useful if e.g. different flashlights use different data sets and bin labels would not match.

pred

Optional vector with predictions (after application of inverse link). Can be used to avoid recalculation of predictions over and over if the functions is to be repeatedly called for different v and predictions are computationally expensive to make. Only relevant for type = "predicted" and type = "ale".

pd_evaluate_at

Vector with values of v used to evaluate the profile. Only relevant for type = "partial dependence" and "ale".

pd_grid

A data.frame with grid values, e.g. generated by expand.grid. Only used for type = "partial dependence".

pd_indices

A vector of row numbers to consider in calculating partial dependence profiles. Only used for type = "partial dependence" and "ale".

pd_n_max

Maximum number of ICE profiles to calculate (will be randomly picked from data). Only used for type = "partial dependence" and "ale".

pd_seed

Integer random seed used to select ICE profiles. Only used for type = "partial dependence" and "ale".

pd_center

Should ICE curves be centered within by subsets before caclulating partial dependence profiles? This option is interesting together with stats = "quartiles" in order to visualize interaction strength.

ale_two_sided

If TRUE, v is continuous and breaks are passed or being calculated, then two-sided derivatives are calculated for ALE instead of left derivatives. More specifically: Usually, local effects at value x are calculated using points between x-e and x. Set ale_two_sided = TRUE to use points between x-e/2 and x+e/2.

Value

An object of classes light_profile, light (and a list) with the following elements.

data A tibble containing results. Can be used to build fully customized visualizations. Its column names are specified by all other items in this list.
by Names of group by variable.
v The variable(s) evaluated.
type Same as input type. For information only.
stats Same as input stats.
value_name Same as input value_name.
q1_name Same as input q1_name.
q3_name Same as input q3_name.
label_name Same as input label_name.
type_name Same as input type_name.
counts_name Same as input counts_name.

Methods (by class)

default: Default method not implemented yet.
flashlight: Profiles for flashlight.
multiflashlight: Profiles for multiflashlight.

Details

For numeric covariables v with more than n_bins disjoint values, its values are binned. Alternatively, breaks can be provided to specify the binning. For partial dependence profiles (and partly also ALE profiles), this behaviour can be overritten either by providing a vector of evaluation points (pd_evaluate_at) or an evaluation pd_grid. By the latter we mean a data frame with column name(s) with a (multi-)variate evaluation grid. For partial dependence, ALE and prediction profiles, "model", "predict_function", linkinv" and "data" are required. For response profiles its just "y", "linkinv" and "data". "data" can be passed on the fly for both types.

References

[1] Friedman J. H. (2001). Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29:1189<U+2013>1232. [2] Apley D. W. (2016). Visualizing the effects of predictor variables in black box supervised learning models. ArXiv <arXiv:1612.08468>.

Examples

Run this code

# NOT RUN {
fit_full <- lm(Sepal.Length ~ ., data = iris)
fit_part <- lm(Sepal.Length ~ Petal.Length, data = iris)
mod_full <- flashlight(model = fit_full, label = "full", data = iris, y = "Sepal.Length")
mod_part <- flashlight(model = fit_part, label = "part", data = iris, y = "Sepal.Length")
mods <- multiflashlight(list(mod_full, mod_part))

light_profile(mod_full, v = "Species")
light_profile(mod_full, v = "Species", counts = FALSE)
light_profile(mod_full, v = "Species", type = "response")
light_profile(mod_full, v = "Species", type = "ale")
light_profile(mod_full, v = "Species", stats = "quartiles")

light_profile(mod_full, v = "Petal.Width")
light_profile(mod_full, v = "Petal.Width", type = "residual")
light_profile(mod_full, v = "Petal.Width", type = "residual", v_label = FALSE)
light_profile(mod_full, v = "Petal.Width", type = "residual", dig.lab = 1)
light_profile(mod_full, v = "Petal.Width", stats = "quartiles")
light_profile(mod_full, v = "Petal.Width", n_bins = 3)
light_profile(mod_full, v = "Petal.Width", pd_evaluate_at = 2:4)
light_profile(mod_full, pd_grid = data.frame(Petal.Width = 2:4))

light_profile(mod_full, v = "Petal.Width", by = "Species")

light_profile(mods, v = "Petal.Width")
light_profile(mods, v = "Petal.Width", by = "Species")
light_profile(mods, v = "Petal.Width", by = "Species", type = "predicted")
light_profile(mods, v = "Petal.Width", by = "Species",
  type = "predicted", stats = "quartiles")

light_profile(mods, v = "Petal.Width", by = "Species", stats = "quartiles",
  value_name = "pd", q1_name = "p25", q3_name = "p75", label_name = "model",
  type_name = "visualization", counts_name = "n")
# }

Run the code above in your browser using DataLab