lm2: Enhanced alternative to lm()

Description

Runs a linear regression with better defaults (robust SE), and richer & better formatted output than lm. For robust and clustered errors it relies on lm_robust. The output reports classical and robust errors, number of missing observations per variable, an effect size column (standardized regression coefficient), and a red.flag column per variable flagging the need to conduct specific diagnostics. It relies by default on HC3 for standard errors; lm_robust relies on HC2 (and Stata's 'reg y x, robust' on HC1), which can have inflated false-positive rates in smaller samples (Long & Ervin, 2000).

Value

An object of class c("lm2", "lm_robust", "lm"). This inherits from lm_robust and can be used with packages like

marginaleffects. The object contains all components of an lm_robust

object plus additional attributes:

statuser_table: A data frame with columns: term, estimate, SE.robust, SE.classical, t, df, p.value, B (standardized coefficient), and optionally SE.cluster when clustered standard errors are used.
classical_fit: The underlying lm object with classical standard errors.
na_counts: Integer vector of missing value counts per variable.
n_missing: Total number of observations excluded due to missing values.
has_clusters: Logical indicating whether clustered standard errors were used.

When printed, displays a formatted regression table with robust and classical standard errors, effect sizes, and diagnostic red flags.

Arguments

se_type: The type of standard error to use. Default is "HC3". Without clusters: "HC0", "HC1", "HC2", or "HC3". When clusters is specified, se_type is automatically set to "CR2".
notes: Logical. If TRUE (default), print explanatory notes below the table when the result is printed.
clusters: An optional variable indicating clusters for cluster-robust standard errors. When specified, se_type is automatically set to "CR2" (bias-reduced cluster-robust estimator). Passed to lm_robust.
fixed_effects: An optional right-sided formula containing the fixed effects to be projected out (absorbed) before estimation. Useful for models with many fixed effect groups (e.g., ~ firm_id or ~ firm_id + year). Passed to lm_robust.
...: Additional arguments passed to lm_robust.

Details

Robust standard errors and clustered standard errors are computed using lm_robust; see the documentation of that function for details (using by default CR2 errors) The output shows both standard errors and when clustering errors it reports all three. The red.flag column is based on the difference between robust and classical standard errors.

The red.flag column provides diagnostic warnings:

!, !!, !!!: Robust and classical standard errors differ by more than 25%, 50%, or 100%, respectively. Large differences may suggest model misspecification or outliers (but they may also be benign). When encountering a red flag, authors should plot the distributions to look for outliers or skewed data, and use scatter.gam to look for possible nonlinearities in the relevant variables. King & Roberts (2015) propose a higher cutoff, at 100%, and a bootstrapped significance test; statuser does not follow either recommendation. The former seems too liberal, the latter too time consuming to include in every regression, plus the focus here is on individual variables rather than joint tests.
X: For interaction terms, the component variables are correlated (|r| > 0.3 or p < .05), which means the interaction term is likely to be biased. See Simonsohn (2024) "Interacting with curves" tools:::Rd_expr_doi("10.1177/25152459231207787").

References

King, G., & Roberts, M. E. (2015). How robust standard errors expose methodological problems they do not fix, and what to do about it. Political Analysis, 23(2), 159-179.

Long, J. S., & Ervin, L. H. (2000). Using heteroscedasticity consistent standard errors in the linear regression model. The American Statistician, 54(3), 217-224.

Simonsohn, U. (2024). Interacting with curves: How to validly test and probe interactions in the real (nonlinear) world. Advances in Methods and Practices in Psychological Science, 7(1), 1-22.

Examples

Run this code

# Basic usage with data argument
lm2(mpg ~ wt + hp, data = mtcars)

# Without data argument (variables from environment)
y <- mtcars$mpg
x1 <- mtcars$wt
x2 <- mtcars$hp
lm2(y ~ x1 + x2)

# RED FLAG EXAMPLES

# Example 1: red flag catches a nonlinearity
# True model is quadratic: y = x^2
set.seed(123)
x <- runif(200, -3, 3)
y <- x^2 + rnorm(200, sd = 2)

# lm2() shows red flag due to misspecification
lm2(y ~ x)

# Follow up with scatter.gam() to diagnose it
scatter.gam(x, y)

# Example 2: red flag catches an outlier in y
# True model is y = x, but one observation has a very large y value
set.seed(123)
x <- sort(rnorm(200))
y <- round(x + rnorm(200, sd = 2), 1)
y[200] <- 100  # Outlier

# lm2() flags x
lm2(y ~ x)

# Look at distribution of y to spot the outlier
plot_freq(y)

# Example 3: red flag catches an outlier in one predictor
# True model is y = x1 + x2, but x2 has an extreme value
set.seed(123)
x1 <- round(rnorm(200),.1)
x2 <- round(rnorm(200),.1)
y <- x1 + x2 + rnorm(200, sd = 0.5)
x2[200] <- 50  # Outlier in x2

# lm2() flags x2 (but not x1)
lm2(y ~ x1 + x2)

# Look at distribution of x2 to spot the outlier
plot_freq(x2)

# CLUSTERED STANDARD ERRORS
# When observations are grouped (e.g., students within schools),
# use clusters to account for within-group correlation
set.seed(123)
n_clusters <- 20
n_per_cluster <- 15
cluster_id <- rep(1:n_clusters, each = n_per_cluster)
cluster_effect <- rnorm(n_clusters, sd = 2)[cluster_id]
x <- rnorm(n_clusters * n_per_cluster)
y <- 1 + 0.5 * x + cluster_effect + rnorm(n_clusters * n_per_cluster)
mydata <- data.frame(y = y, x = x, cluster_id = cluster_id)

# Clustered SE (CR2) - note the SE.cluster column in output
lm2(y ~ x, data = mydata, clusters = cluster_id)

# FIXED EFFECTS
# Use fixed_effects to absorb group-level variation (e.g., firm or year effects)
# This is useful for panel data or when you have many fixed effect levels
set.seed(456)
n_firms <- 30
n_years <- 5
firm_id <- rep(1:n_firms, each = n_years)
year <- rep(2018:2022, times = n_firms)
firm_effect <- rnorm(n_firms, sd = 3)[firm_id]
x <- rnorm(n_firms * n_years)
y <- 2 + 0.8 * x + firm_effect + rnorm(n_firms * n_years)
panel <- data.frame(y = y, x = x, firm_id = factor(firm_id), year = factor(year))

# Absorb firm fixed effects (coefficient on x is estimated, firm dummies are not shown)
lm2(y ~ x, data = panel, fixed_effects = ~ firm_id)

# Two-way fixed effects (firm and year)
lm2(y ~ x, data = panel, fixed_effects = ~ firm_id + year)