ggaverage: Get marginal effects from model terms

Description

ggpredict() computes predicted (fitted) values for the response, at the margin of specific values from certain model terms, where additional model terms indicate the grouping structure. ggeffect() computes marginal effects by internally calling Effect. ggaverage() computes the average predicted values. The result is returned as tidy data frame.

Usage

ggaverage(model, terms, ci.lvl = 0.95, type = c("fe", "re", "fe.zi",
  "re.zi"), typical = "mean", ppd = FALSE, x.as.factor = FALSE,
  condition = NULL, ...)
ggeffect(model, terms, ci.lvl = 0.95, x.as.factor = FALSE, ...)
ggpredict(model, terms, ci.lvl = 0.95, type = c("fe", "re", "fe.zi",
  "re.zi", "surv", "cumhaz", "debug"), typical = "mean",
  condition = NULL, ppd = FALSE, x.as.factor = FALSE,
  full.data = FALSE, vcov.fun = NULL, vcov.type = NULL,
  vcov.args = NULL, ...)

Arguments

model

A fitted model object, or a list of model objects. Any model that supports common methods like predict(), family() or model.frame() should work. For ggeffect(), any model that is supported by the effects-package should work.

terms

Character vector (or a formula) with the names of those terms from model, for which marginal effects should be displayed. At least one term is required to calculate effects for certain terms, maximum length is three terms, where the second and third term indicate the groups, i.e. predictions of first term are grouped by the levels of the second (and third) term. If terms is missing or NULL, marginal effects for each model term are calculated. It is also possible to define specific values for terms, at which marginal effects should be calculated (see 'Details'). All remaining covariates that are not specified in terms are held constant (if full.data = FALSE, the default) or are set to the values from the observations (i.e. are kept as they happen to be; see 'Details'). See also argument condition and typical.

ci.lvl

Numeric, the level of the confidence intervals. For ggpredict(), use ci.lvl = NA, if confidence intervals should not be calculated (for instance, due to computation time).

type

Character, only applies for mixed effects models and/or models with zero-inflation.

"fe": Predicted values are conditioned on the fixed effects or conditional model only. For instance, for models fitted with zeroinfl from pscl, this would return the predicted mean from the count component (without zero-inflation). For models of class glmmTMB, this type calls predict(..., type = "link").
"re": Predicted values are conditioned on the random effects. This only applies to mixed models, and type = "re" does not condition on the zero-inflation component of the model, nor on different group levels. type = "re" uses the reference level in the random effects groups (except for glmmTMB models, see below), and prediction intervals also consider the uncertainty in the variance parameters. For models from glmmTMB, this type calls predict(..., type = "link"). Note: For glmmTMB models, the random effect variances only affect the confidence intervals of predictions, not the predicted values themselves (because this is currently not implemented in glmmTMB), i.e. predicted values are on population-level. To get predicted values for each level of the random effects groups, add the name of the related random effect term to the terms-argument (for more details, see this vignette).
"fe.zi": Predicted values are conditioned on the fixed effects and the zero-inflation component. For instance, for models fitted with zeroinfl from pscl, this would return the predicted response and for glmmTMB, this would return the expected value mu*(1-p) without conditioning on random effects. For models of class glmmTMB, this type calls predict(..., type = "response").
"re.zi": Predicted values are conditioned on the random effects and the zero-inflation component. For models fitted with glmmTMB, this would return the expected value mu*(1-p), conditioned on random effects. Prediction intervals also consider the uncertainty in the variance parameters. For models from glmmTMB, this type calls simulate(), because conditioning on random effects is not yet implemented in predict.glmmTMB().
"surv" and "cumhaz": Applies only to coxph-objects from the survial-package and calculates the survival probability or the cumulative hazard of an event.
"debug": Only used internally.

typical

Character vector, naming the function to be applied to the covariates over which the effect is "averaged". The default is "mean". See typical_value for options.

ppd

Logical, if TRUE, predictions for Stan-models are based on the posterior predictive distribution (posterior_predict). If FALSE (the default), predictions are based on posterior draws of the linear predictor (posterior_linpred).

x.as.factor

Logical, if TRUE, preserves factor-class as x-column in the returned data frame. By default, the x-column is always numeric.

condition

Named character vector, which indicates covariates that should be held constant at specific values. Unlike typical, which applies a function to the covariates to determine the value that is used to hold these covariates constant, condition can be used to define exact values, for instance condition = c(covariate1 = 20, covariate2 = 5). See 'Examples'.

...

For ggpredict(), further arguments passed down to predict(), and for ggeffect(), further arguments passed down to Effect. If model is of class glmmTMB, ... may also be used to set the number of simulation for bootstrapped confidence intervals, e.g. nsim = 500.

full.data

Logical, if TRUE, the returned data frame contains predictions for all observations. This data frame also has columns for residuals and observed values, and can also be used to plot a scatter plot of all data points or fitted values. If FALSE (the default), the returned data frame only contains predictions for all combinations of unique values of the model's predictors. Residuals and observed values are set to NA. Usually, this argument is only used internally by ggaverage().

vcov.fun

String, indicating the name of the vcov*()-function from the sandwich-package, e.g. vcov.fun = "vcovCL", which is used to compute robust standard errors for predictions. If NULL, standard errors (and confidence intervals) for predictions are based on the standard errors as returned by the predict()-function. Note that probably not all model objects that work with ggpredict() are also supported by the sandwich-package.

vcov.type

Character vector, specifying the estimation type for the robust covariance matrix estimation (see vcovHC for details).

vcov.args

List of named vectors, used as additional arguments that are passed down to vcov.fun.

Value

A data frame (with ggeffects class attribute) with consistent data columns:

x: the values of the first term in terms, used as x-position in plots.
predicted: the predicted values of the response, used as y-position in plots.
std.error: the standard error of the predictions.
conf.low: the lower bound of the confidence interval for the predicted values.
conf.high: the upper bound of the confidence interval for the predicted values.
observed: if full.data = TRUE, this columns contains the observed values (the response vector).
residuals: if full.data = TRUE, this columns contains residuals.
group: the grouping level from the second term in terms, used as grouping-aesthetics in plots.
facet: the grouping level from the third term in terms, used to indicate facets in plots.

For proportional odds logistic regression (see polr) resp. cumulative link models (e.g., see clm), an additional column response.level is returned, which indicates the grouping of predictions based on the level of the model's response.

Details

Supported Models

Currently supported model-objects are: lm, glm, glm.nb, lme, lmer, glmer, glmer.nb, nlmer, glmmTMB, gam, vgam, gamm, gamm4, multinom, betareg, gls, gee, plm, lrm, polr, clm, clm2, hurdle, zeroinfl, svyglm, svyglm.nb, truncreg, coxph, stanreg, brmsfit, lmRob, glmRob, brglm and rlm. Other models not listed here are passed to a generic predict-function and might work as well, or maybe with ggeffect(), which effectively does the same as ggpredict(). The main difference is that ggpredict() calls predict(), while ggeffect() calls Effect to compute marginal effects.

Difference between ggpredict() and ggeffect()

ggpredict() and ggeffect() differ in how factors are held constant: ggpredict() uses the reference level, while ggeffect() computes a kind of "average" value, which represents the proportions of each factor's category.

Marginal Effects at Specific Values

Specific values of model terms can be specified via the terms-argument. Indicating levels in square brackets allows for selecting only specific groups or values resp. value ranges. Term name and levels in brackets must be separated by a whitespace character, e.g. terms = c("age", "education [1,3]"). Numeric ranges, separated with colon, are also allowed: terms = c("education", "age [30:60]").

The terms-argument also supports the same shortcuts as the values-argument in rprs_values(). So terms = "age [meansd]" would return predictions for the values one standard deviation below the mean age, the mean age and one SD above the mean age. terms = "age [quart2]" would calculate predictions at the value of the lower, median and upper quartile of age.

Furthermore, it is possible to specify a function name. Values for predictions will then be transformed, e.g. terms = "income [exp]". This is useful when model predictors were transformed for fitting the model and should be back-transformed to the original scale for predictions.

You can take a random sample of any size with sample=n, e.g terms = "income [sample=8]", which will sample eight values from all possible values of the variable income. This option is especially useful for plotting marginal effects at certain levels of random effects group levels, where the group factor has many levels that can be completely plotted. For more details, see this vignette.

Finally, numeric vectors for which no specific values are given, a "pretty range" is calculated (see pretty_range), to avoid memory allocation problems for vectors with many unique values. If a numeric vector is specified as second or third term (i.e. if this vector represents a grouping structure), representative values (see rprs_values) are chosen. If all values for a numeric vector should be used to compute predictions, you may use e.g. terms = "age [all]". See also package vignettes.

To create a pretty range that should be smaller or larger than the default range (i.e. if no specific values would be given), use the n-tag, e.g. terms="age [n=5]" or terms="age [n=12]". Larger values for n return a larger range of predicted values.

Holding covariates at constant values

For ggpredict(), if full.data = FALSE, expand.grid() is called on all unique combinations of model.frame(model)[, terms] and used as newdata-argument for predict(). In this case, all remaining covariates that are not specified in terms are held constant: Numeric values are set to the mean (unless changed with the condition or typical-argument), factors are set to their reference level (may also be changed with condition) and character vectors to their mode (most common element).

ggaverage() computes the average predicted values, by calling ggpredict() with full.data = TRUE, where argument newdata = model.frame(model) is used in predict(). Hence, predictions are made on the model data. In this case, all remaining covariates that are not specified in terms are not held constant, but vary between observations (and are kept as they happen to be). The predicted values are then averaged for each group (if any). Thus, ggpredict() can be considered as calculating marginal effects at the mean, while ggaverage() computes average marginal effects.

ggeffect(), by default, sets remaining numeric covariates to their mean value, while for factors, a kind of "average" value, which represents the proportions of each factor's category, is used.

Bayesian Regression Models

ggpredict() also works with Stan-models from the rstanarm or brms-package. The predicted values are the median value of all drawn posterior samples. The confidence intervals for Stan-models are actually high density intervals, computed by hdi, unless ppd = TRUE. If ppd = TRUE, predictions are based on draws of the posterior predictive distribution and the uncertainty interval is computed using predictive_interval. By default (i.e. ppd = FALSE), the predictions are based on posterior_linpred and hence have some limitations: the uncertainty of the error term is not taken into account. The recommendation is to use the posterior predictive distribution (posterior_predict). Note that for binomial models, the newdata-argument used in posterior_predict() must also contain the vector with the number of trials. In this case, a dummy-vector is used, where all values for the response are set to 1.

Zero-Inflated Mixed Models with glmmTMB

If model is of class glmmTMB, bootstrapped confidence intervals are calculated for predictions conditioned on the zero-inflated part of the model, when the uncertainty in the random-effect paramters is ignored (i.e. when type = "fe.zi", see Brooks et al. 2017, pp.391-392 for details). type = "fe.zi" returns predicted values at population mode, not mean. If predictions are also conditioned on random effects (i.e. type = "re.zi"), predicted values are based on simulations (see Brooks et al. 2017, pp.392-393 for details).

References

Brooks ME, Kristensen K, Benthem KJ van, Magnusson A, Berg CW, Nielsen A, et al. glmmTMB Balances Speed and Flexibility Among Packages for Zero-inflated Generalized Linear Mixed Modeling. The R Journal. 2017;9: 378<U+2013>400.

Examples

Run this code

# NOT RUN {
data(efc)
fit <- lm(barthtot ~ c12hour + neg_c_7 + c161sex + c172code, data = efc)

ggpredict(fit, terms = "c12hour")
ggpredict(fit, terms = "c12hour", full.data = TRUE)
ggpredict(fit, terms = c("c12hour", "c172code"))
ggpredict(fit, terms = c("c12hour", "c172code", "c161sex"))

# specified as formula
ggpredict(fit, terms = ~ c12hour + c172code + c161sex)

# only range of 40 to 60 for variable 'c12hour'
ggpredict(fit, terms = "c12hour [40:60]")

# using "summary()" shows that covariate "neg_c_7" is held
# constant at a value of 11.84 (its mean value). To use a
# different value, use "condition"
ggpredict(fit, terms = "c12hour [40:60]", condition = c(neg_c_7 = 20))

# to plot ggeffects-objects, you can use the 'plot()'-function.
# the following examples show how to build your ggplot by hand.

# plot predicted values, remaining covariates held constant
library(ggplot2)
mydf <- ggpredict(fit, terms = "c12hour")
ggplot(mydf, aes(x, predicted)) +
  geom_line() +
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = .1)

# with "full.data = TRUE", remaining covariates vary between
# observations, so fitted values can be plotted
mydf <- ggpredict(fit, terms = "c12hour", full.data = TRUE)
ggplot(mydf, aes(x, predicted)) + geom_point()

# you can add a smoothing-geom to show the linear trend of fitted values
ggplot(mydf, aes(x, predicted)) +
  geom_smooth(method = "lm", se = FALSE) +
  geom_point()

# three variables, so we can use facets and groups
mydf <- ggpredict(
  fit,
  terms = c("c12hour", "c161sex", "c172code"),
  full.data = TRUE
)
ggplot(mydf, aes(x = x, y = predicted, colour = group)) +
  stat_smooth(method = "lm", se = FALSE) +
  facet_wrap(~facet, ncol = 2)

# average marginal effects
mydf <- ggaverage(fit, terms = c("c12hour", "c172code"))
ggplot(mydf, aes(x = x, y = predicted, colour = group)) +
  stat_smooth(method = "lm", se = FALSE)

# select specific levels for grouping terms
mydf <- ggpredict(fit, terms = c("c12hour", "c172code [1,3]", "c161sex"))
ggplot(mydf, aes(x = x, y = predicted, colour = group)) +
  stat_smooth(method = "lm", se = FALSE) +
  facet_wrap(~facet) +
  labs(
    y = get_y_title(mydf),
    x = get_x_title(mydf),
    colour = get_legend_title(mydf)
  )

# level indication also works for factors with non-numeric levels
# and in combination with numeric levels for other variables
library(sjlabelled)
data(efc)
efc$c172code <- as_label(efc$c172code)
fit <- lm(barthtot ~ c12hour + neg_c_7 + c161sex + c172code, data = efc)
ggpredict(fit, terms = c("c12hour",
  "c172code [low level of education, high level of education]",
  "c161sex [1]"))

# use categorical value on x-axis, use axis-labels, add error bars
dat <- ggpredict(fit, terms = c("c172code", "c161sex"))
ggplot(dat, aes(x, predicted, colour = group)) +
  geom_point(position = position_dodge(.1)) +
  geom_errorbar(
    aes(ymin = conf.low, ymax = conf.high),
    position = position_dodge(.1)
  ) +
  scale_x_continuous(breaks = 1:3, labels = get_x_labels(dat))

# 3-way-interaction with 2 continuous variables
data(efc)
# make categorical
efc$c161sex <- as_factor(efc$c161sex)
fit <- lm(neg_c_7 ~ c12hour * barthtot * c161sex, data = efc)
# select only levels 30, 50 and 70 from continuous variable Barthel-Index
dat <- ggpredict(fit, terms = c("c12hour", "barthtot [30,50,70]", "c161sex"))
ggplot(dat, aes(x = x, y = predicted, colour = group)) +
  stat_smooth(method = "lm", se = FALSE, fullrange = TRUE) +
  facet_wrap(~facet) +
  labs(
    colour = get_legend_title(dat),
    x = get_x_title(dat),
    y = get_y_title(dat),
    title = get_title(dat)
  )

# or with ggeffects' plot-method
# }
# NOT RUN {
plot(dat, ci = FALSE)
# }
# NOT RUN {
# use factor levels as x-column in returned data frame
data(efc)
efc$c161sex <- as_label(efc$c161sex)
fit <- lm(neg_c_7 ~ c12hour + c161sex, data = efc)
ggpredict(fit, terms = "c161sex", x.as.factor = TRUE)

# marginal effects for polynomial terms
data(efc)
fit <- glm(
  tot_sc_e ~ c12hour + e42dep + e17age + I(e17age^2) + I(e17age^3),
  data = efc,
  family = poisson()
)
ggeffect(fit, terms = "e17age")

# }

Run the code above in your browser using DataLab