lmInfl: Checks and analyzes leave-one-out (LOO) p-values and a variety of influence measures in linear regression

Description

This function calculates leave-one-out (LOO) p-values for all data points and identifies those resulting in "significance reversal", i.e. in the p-value of the model's slope traversing the user-defined $\alpha$-level. It also extends the classical influence measures from influence.measures with a few newer ones (e.g, 'Hadi's measure', 'Coefficient of determination ratio' and 'Pena's Si') within an output format where each outlier is marked when exceeding the measure's specific threshold, as defined in the literature. Belsley, Kuh & Welsch's dfstat criterion is also included.

Usage

lmInfl(model, alpha = 0.05, cutoff = c("BKW", "R"), verbose = TRUE, ...)

Value

A list with the following items:

origModel: the original model with all data points.
finalModels: a list of final models with the influencer(s) removed.
infl: a matrix with the original data, classical influence.measures, studentized residuals, leverages, dfstat, LOO-p-values, LOO-slopes/intercepts and their $\Delta$'s, LOO-standard errors and $R^2$s. Influence measures that exceed their specific threshold - see inflPlot - will be marked with asterisks.
raw: same as infl, but with pure numeric data.
sel: a vector with the influencers' indices.
alpha: the selected $\alpha$-level.
origP: the original model's p-value.

Arguments

model: the linear model of class lm.
alpha: the $\alpha$-level to use as the threshold border.
cutoff: use the cutoff-values from Belsley, Kuh & Welsch or the R-internal ones. See 'Details'.
verbose: logical. If TRUE, results are displayed on the console.
...: other arguments to lm.

Author

Andrej-Nikolai Spiess

Details

The algorithm
1) calculates the p-value of the full model (all points),
2) calculates a LOO-p-value for each point removed,
3) checks for significance reversal in all data points and
4) returns all models as well as classical influence.measures with LOO-p-values, $\Delta$p-values, slopes and standard errors attached.

The idea of p-value influencers was first introduced by Belsley, Kuh & Welsch, and described as an influence measure pertaining directly to the change in t-statistics, that will "show whether the conclusions of hypothesis testing would be affected", termed dfstat in [1, 2, 3] or dfstud in [4]: $$\rm{dfstat}_{ij} \equiv \frac{\hat{\beta}_j}{s\sqrt{(X'X)^{-1}_{jj}}}-\frac{\hat{\beta}_{j(i)}}{s_{(i)}\sqrt{(X'_{(i)}X_{(i)})^{-1}_{jj}}}$$ where $\hat{\beta}_j$ is the j-th estimate, s is the residual standard error, X is the design matrix and (i) denotes the i-th observation deleted.
dfstat, which for the regression's slope $\beta_1$ is the difference of t-statistics $$\Delta t = t_{\beta1} - t_{\beta1(i)} = \frac{\beta_1}{\rm{s.e.(\beta_1)}} - \frac{\beta_1(i)}{\rm{s.e.(\beta_1(i)})}$$ is inextricably linked to the changes in p-value $\Delta p$, calculated from $$\Delta p = p_{\beta1} - p_{\beta1(i)} = 2\left(1-P_t(t_{\beta1}, \nu)\right) - 2\left(1-P_t(t_{\beta1(i)} , \nu-1)\right)$$ where $P_t$ is the Student's t cumulative distribution function with $\nu$ degrees of freedom, and where significance reversal is attained when $\alpha \in [p_{\beta1}, p_{\beta1(i)}]$. Interestingly, the seemingly mandatory check of the influence of single data points on statistical inference is living in oblivion: apart from [1-4], there is, to the best of our knowledge, no reference to dfstat or $\Delta p$ in current literature on influence measures.

Cut-off values for the different influence measures are per default (cutoff = "BKW") those defined in Belsley, Kuh & Welsch (1980) and additional literature.

dfbeta slope: $| \Delta\beta1_i | > 2/\sqrt{n}$ (page 28)
dffits: $| \mathrm{dffits}_i | > 2\sqrt{2/n}$ (page 28)
covratio: $|\mathrm{covr}_i - 1| > 3k/n$ (page 23)
Cook's D: $D_i > Q_F(0.5, k, n - k)$ (Cook & Weisberg, 1982)
leverage: $h_{ii} > 2k/n$ (page 17)
studentized residual: $t_i > Q_t(0.975, n - k - 1)$ (page 20)

If (cutoff = "R"), the criteria from influence.measures are employed:

dfbeta slope: $| \Delta\beta1_i | > 1$
dffits: $| \mathrm{dffits}_i | > 3\sqrt{(k/(n - k))}$
covratio: $|1 - \mathrm{covr}_i| > 3k/(n - k)$
Cook's D: $D_i > Q_F(0.5, k, n - k)$
leverage: $h_{ii} > 3k/n$

The influence output also includes the following more "recent" measures:
Hadi's measure (column "hadi"): $$H_i^2 = \frac{h_{ii}}{1 - h_{ii}} + \frac{p}{1 - h_{ii}}\frac{d_i^2}{(1-d_i^2)}$$ where $h_{ii}$ are the diagonals of the hat matrix (leverages), $p = 2$ in univariate linear regression and $d_i = e_i/\sqrt{\rm{SSE}}$, and threshold value $\mathrm{Med}(H_i^2) + 2 \cdot \mathrm{MAD}(H_i^2)$.

Coefficient of Determination Ratio (column "cdr"): $$\mathrm{CDR}_i = \frac{R_{(i)}^2}{R^2}$$ with $R_{(i)}^2$ being the coefficient of determination without value i, and threshold $$\frac{B_{\alpha,p/2,(n-p-2)/2}}{B_{\alpha,p/2,(n-p-1)/2}}$$

Pena's Si (column "Si"): $$S_i = \frac{\mathbf{s}'_i\mathbf{s}_i}{p\widehat{\mathrm{var}}(\hat{y}_i)}$$ where $\mathbf{s_i}$ is the vector of each fitted value from the original model, $\hat{y}_i$, subtracted with all fitted values after 1-deletion, $\hat{y}_i - \hat{y}_{i(-1)}, \cdots, \hat{y}_i - \hat{y}_{i(-n)}$, $p$ = number of parameters, and $\widehat{\mathrm{var}}(\hat{y}_i) = s^2h_{ii}$, $s^2 = (\mathbf{e}'\mathbf{e})/(n - p)$, $\mathbf{e}$ being the residuals. In this package, a cutoff value of 0.9 is used, as the published criterion of $|\mathbf{S_i} - \mathrm{Med}(\mathbf{S})| \ge 4.5\mathrm{MAD}(\mathbf{S})$ seemed too conservative. Results from this function were verified by Prof. Daniel Pena through personal communication.

References

For dfstat / dfstud :
Regression diagnostics: Identifying influential data and sources of collinearity.
Belsley DA, Kuh E, Welsch RE.
John Wiley, New York, USA (2004).

Econometrics, 5ed.
Baltagi B.
Springer-Verlag Berlin, Germany (2011).

Growth regressions and what the textbooks don't tell you.
Temple J.
Bull Econom Res, 52, 2000, 181-205.

Robust Regression and Outlier Detection.
Rousseeuw PJ & Leroy AM.
John Wiley & Sons, New York, NY (1987).

Hadi's measure:
A new measure of overall potential influence in linear regression.
Hadi AS.
Comp Stat & Data Anal, 14, 1992, 1-27.

Coefficient of determination ratio:
On the detection of influential outliers in linear regression analysis.
Zakaria A, Howard NK, Nkansah BK.
Am J Theor Appl Stat, 3, 2014, 100-106.

On the Coefficient of Determination Ratio for Detecting Influential Outliers in Linear Regression Analysis.
Zakaria A, Gordor BK, Nkansah BK.
Am J Theor Appl Stat, 11, 2022, 27-35.

Pena's measure:
A New Statistic for Influence in Linear Regression.
Pena D.
Technometrics, 47, 2005, 1-12.

Examples

Run this code

## Example #1 with single influencer and significant model (p = 0.0089).
## Removal of #21 results in p = 0.115!
set.seed(123)
a <- 1:20
b <- 5 + 0.08 * a + rnorm(20, 0, 1)
a <- c(a, 25); b <- c(b, 10)
LM1 <- lm(b ~ a)
lmInfl(LM1) 

## Example #2 with single influencer and insignificant model (p = 0.115).
## Removal of #18 results in p = 0.0227!
set.seed(123)
a <- 1:20
b <- 5 + 0.08 * a + rnorm(20, 0, 1)
LM2 <- lm(b ~ a)
lmInfl(LM2) 

## Example #3 with multiple influencers and significant model (p = 0.0269).
## Removal of #2, #17, #18 or #20 results in crossing p = 0.05!
set.seed(125)
a <- 1:20
b <- 5 + 0.08 * a + rnorm(20, 0, 1)
LM3 <- lm(b ~ a)
lmInfl(LM3) 

## Large Example #4 with top 10 influencers and significant model (p = 6.72E-8).
## Not possible to achieve a crossing of alpha with any point despite strong noise.
set.seed(123)
a <- 1:100
b <- 5 + 0.08 * a + rnorm(100, 0, 5)
LM4 <- lm(b ~ a)
lmInfl(LM4)

Run the code above in your browser using DataLab