lm_tidiers: Tidying methods for a linear model

Description

These methods tidy the coefficients of a linear model into a summary, augment the original data with information on the fitted values and residuals, and construct a one-row glance of the model's statistics.

Usage

# S3 method for lm
tidy(x, conf.int = FALSE, conf.level = 0.95,
  exponentiate = FALSE, quick = FALSE, ...)
# S3 method for summary.lm
tidy(x, ...)
# S3 method for lm
augment(x, data = stats::model.frame(x), newdata, type.predict,
  type.residuals, ...)
# S3 method for lm
glance(x, ...)
# S3 method for summary.lm
glance(x, ...)

Arguments

lm object

conf.int

whether to include a confidence interval

conf.level

confidence level of the interval, used only if conf.int=TRUE

exponentiate

whether to exponentiate the coefficient estimates and confidence intervals (typical for logistic regression)

quick

whether to compute a smaller and faster version, containing only the term and estimate columns.

...

extra arguments (not used)

data

Original data, defaults to the extracting it from the model

newdata

If provided, performs predictions on the new data

type.predict

Type of prediction to compute for a GLM; passed on to predict.glm

type.residuals

Type of residuals to compute for a GLM; passed on to residuals.glm

Value

All tidying methods return a data.frame without rownames. The structure depends on the method chosen.

tidy.lm returns one row for each coefficient, with five columns:

term

The term in the linear model being estimated and tested

estimate

The estimated coefficient

std.error

The standard error from the linear model

statistic

t-statistic

p.value

two-sided p-value

If the linear model is an "mlm" object (multiple linear model), there is an additional column:

response

Which response column the coefficients correspond to (typically Y1, Y2, etc)

If conf.int=TRUE, it also includes columns for conf.low and conf.high, computed with confint.

When newdata is not supplied augment.lm returns one row for each observation, with seven columns added to the original data:

.hat

Diagonal of the hat matrix

.sigma

Estimate of residual standard deviation when corresponding observation is dropped from model

.cooksd

Cooks distance, cooks.distance

.fitted

Fitted values of model

.se.fit

Standard errors of fitted values

.resid

Residuals

.std.resid

Standardised residuals

(Some unusual "lm" objects, such as "rlm" from MASS, may omit .cooksd and .std.resid. "gam" from mgcv omits .sigma)

When newdata is supplied, augment.lm returns one row for each observation, with three columns added to the new data:

.fitted

Fitted values of model

.se.fit

Standard errors of fitted values

.resid

Residuals of fitted values on the new data

glance.lm returns a one-row data.frame with the columns

r.squared

The percent of variance explained by the model

adj.r.squared

r.squared adjusted based on the degrees of freedom

sigma

The square root of the estimated residual variance

statistic

F-statistic

p.value

p-value from the F test, describing whether the full regression is significant

Degrees of freedom used by the coefficients

logLik

the data's log-likelihood under the model

AIC

the Akaike Information Criterion

BIC

the Bayesian Information Criterion

deviance

df.residual

residual degrees of freedom

Details

If you have missing values in your model data, you may need to refit the model with na.action = na.exclude.

If conf.int=TRUE, the confidence interval is computed with the confint function.

While tidy is supported for "mlm" objects, augment and glance are not.

When the modeling was performed with na.action = "na.omit" (as is the typical default), rows with NA in the initial data are omitted entirely from the augmented data frame. When the modeling was performed with na.action = "na.exclude", one should provide the original data as a second argument, at which point the augmented data will contain those rows (typically with NAs in place of the new columns). If the original data is not provided to augment and na.action = "na.exclude", a warning is raised and the incomplete rows are dropped.

Code and documentation for augment.lm originated in the ggplot2 package, where it was called fortify.lm

Examples

Run this code

# NOT RUN {
library(ggplot2)
library(dplyr)

mod <- lm(mpg ~ wt + qsec, data = mtcars)

tidy(mod)
glance(mod)

# coefficient plot
d <- tidy(mod) %>% mutate(low = estimate - std.error,
                          high = estimate + std.error)
ggplot(d, aes(estimate, term, xmin = low, xmax = high, height = 0)) +
     geom_point() +
     geom_vline(xintercept = 0) +
     geom_errorbarh()

head(augment(mod))
head(augment(mod, mtcars))

# predict on new data
newdata <- mtcars %>% head(6) %>% mutate(wt = wt + 1)
augment(mod, newdata = newdata)

au <- augment(mod, data = mtcars)

plot(mod, which = 1)
qplot(.fitted, .resid, data = au) +
  geom_hline(yintercept = 0) +
  geom_smooth(se = FALSE)
qplot(.fitted, .std.resid, data = au) +
  geom_hline(yintercept = 0) +
  geom_smooth(se = FALSE)
qplot(.fitted, .std.resid, data = au,
  colour = factor(cyl))
qplot(mpg, .std.resid, data = au, colour = factor(cyl))

plot(mod, which = 2)
qplot(sample =.std.resid, data = au, stat = "qq") +
    geom_abline()

plot(mod, which = 3)
qplot(.fitted, sqrt(abs(.std.resid)), data = au) + geom_smooth(se = FALSE)

plot(mod, which = 4)
qplot(seq_along(.cooksd), .cooksd, data = au)

plot(mod, which = 5)
qplot(.hat, .std.resid, data = au) + geom_smooth(se = FALSE)
ggplot(au, aes(.hat, .std.resid)) +
  geom_vline(size = 2, colour = "white", xintercept = 0) +
  geom_hline(size = 2, colour = "white", yintercept = 0) +
  geom_point() + geom_smooth(se = FALSE)

qplot(.hat, .std.resid, data = au, size = .cooksd) +
  geom_smooth(se = FALSE, size = 0.5)

plot(mod, which = 6)
ggplot(au, aes(.hat, .cooksd)) +
  geom_vline(xintercept = 0, colour = NA) +
  geom_abline(slope = seq(0, 3, by = 0.5), colour = "white") +
  geom_smooth(se = FALSE) +
  geom_point()
qplot(.hat, .cooksd, size = .cooksd / .hat, data = au) + scale_size_area()

# column-wise models
a <- matrix(rnorm(20), nrow = 10)
b <- a + rnorm(length(a))
result <- lm(b ~ a)
tidy(result)
# }