tsls: Two-stage least squares estimation of the causal exposure effect in instrumental variables scenarios

Description

tsls computes the two-stage least squares (aka Wald) estimate of the causal exposure effect in instrumental variables scenarios. Let $Y$, $X$, and $Z$ be the outcome, exposure, instrument, respectively. Let $L$ be a vector of covariates that we wish to control for in the analysis. The user supplies a fitted generalized linear model (GLM) for $E(X|Z,L)$ and a fitted GLM for $E(Y|X,L)$. tsls uses the GLM for $E(X|Z,L)$ to construct predictions $\hat{X}$. These predictions are subsequently used to re-fit the GLM for $E(Y|X,L)$, with $X$ replaced with $\hat{X}$. The obtained coefficient(s) for $X$ is the estimated causal effect.

Usage

tsls(fitX, fitY, control=FALSE, data, clusterid)

Arguments

fitX

an object of class "glm", as returned by the glm function in the stats package. This is the fitted GLM for $E(X|Z,L)$.

fitY

an object of class "glm", as returned by the glm function in the stats package. This is the fitted GLM for $E(Y|X,L)$. The model is assumed to have a specific form, see details below.

control

should the control function $R=X-\hat{X}$ be used when re-fitting the GLM for $E(Y|X,L)$?

data

a data frame containing the variables in the model. The outcome, exposure, instrument and covariates can have arbitrary names, e.g. they don't need to be called Y, X, Z and L.

clusterid

an optional string containing the name of a cluster identification variable when data are clustered.

Value

An object of class "tsls" is a list containing

est

a vector containing the two-stage least squares estimate $\hat{\psi}_{tsls}$.

vcov

the variance-covariance matric for the two-stage least squares estimate $\hat{\psi}_{tsls}$, obtained with the sandwich formula.

Details

Let $\eta$ be the link function in the GLM for $E(Y|X,L)$. The model should be on the form $$\eta\{E(Y|X,L)\}=m(L;\psi)X+g(L;\beta),$$ e.g. $E(Y|X,L)=\psi X+\beta_0+\beta_1 L$ or $E(Y|X,L)=\psi_0 X+\psi_1 XL+\beta_0+\beta_1 L$. Let $\hat{\psi}_{tsls}$ be the two-stage least squares estimator of $\psi$, i.e. the MLE of $\psi$ when $X$ is replaced by $\hat{X}$ in the model. Let $Y_x$ be the potential outcome for a given subject, under exposure level $X=x$. $\hat{\psi}_{tsls}$ can be interpreted in at least two different ways. If $\eta$ is the identity link, $Z$ is a valid instrument, and the causal (structural nested) model $$A: \eta\{E(Y|X,Z,L)\}-\eta\{E(Y_0|X,Z,L)\}=m(L;\psi^*)X$$ holds, then $\hat{\psi}_{tsls}$ is consistent for $\psi^*$ in this model. Further, let $U$ be the set of all confounders for the exposure-outcome association. If $\eta$ is the identity link, $Z$ is a valid instrument, and the causal model $$B: \eta\{E(Y_x|L,U)\}-\eta\{E(Y_0|L,U)\}=m(L;\psi^{**})X$$ holds, then $\hat{\psi}_{tsls}$ is consistent for $\psi^{**}$ in this model. When $\eta$ is the identity link, model B implies model A, but not the other way around. When $\eta$ is not the identity link, $\hat{\psi}_{tsls}$ is generally inconsistent for both $\psi^*$ and $\psi^{**}$, even if $Z$ is a valid instrument and models A and B hold. The bias is often reduced by using the control function $R=X-\hat{X}$ as an additional regressor when refitting the GLM for $E(Y|X,L)$. We refer to Vansteelandt et al (2011) for a thorough review of the underlying assumptions, the interpretation, and the asymptotic properties of $\hat{\psi}_{tsls}$.

References

Vansteelandt S., Bowden J., Babanezhad M., Goetghebeur E. (2011). On instrumental variables estimation of causal odds ratios 26(3), 403-422.

Examples

Run this code

# NOT RUN {
##Example 1: identity link and no interaction
n <- 1000
psi <- 0.5
U <- rnorm(n) #confounder for X and Y
L <- rnorm(n) #confounder for Z and Y
Z <- rnorm(n) #instrument
X <- rnorm(n, mean=Z+L+U) #exposure 
Y <- rnorm(n, mean=psi*X+L+U) #outcome
data <- data.frame(L, Z, X, Y)
fitX <- glm(X~Z+L, data=data)
fitY <- glm(Y~X+L, data=data)
fitIV <- tsls(fitX=fitX, fitY=fitY, data=data)
summary(fitIV)

##Example 2: logistic link and interaction between X and L
n <- 1000
psi0 <- 1
psi1 <- 0.5
U <- rnorm(n) #confounder for X and Y
L <- rnorm(n) #confounder for Z and Y
Z <- rnorm(n) #instrument
X <- rnorm(n, mean=Z+L+U) #exposure
Y <- rbinom(n, 1, plogis(psi0*X+psi1*X*L+L+U)) #outcome
data <- data.frame(L, Z, X, Y)
fitX <- glm(X~Z+L, data=data)
fitY <- glm(Y~X+L+X*L, data=data, family="binomial")
fitIV <- tsls(fitX=fitX, fitY=fitY, data=data, control=TRUE)
summary(fitIV)


# }

Run the code above in your browser using DataLab