Learn R Programming

ivtools (version 1.0.0)

tsls: Two-stage least squares estimation of the causal exposure effect in instrumental variables scenarios

Description

tsls computes the two-stage least squares (aka Wald) estimate of the causal exposure effect in instrumental variables scenarios. Let \(Y\), \(X\), and \(Z\) be the outcome, exposure, instrument, respectively. Let \(L\) be a vector of covariates that we wish to control for in the analysis. The user supplies a fitted generalized linear model (GLM) for \(E(X|Z,L)\) and a fitted GLM for \(E(Y|X,L)\). tsls uses the GLM for \(E(X|Z,L)\) to construct predictions \(\hat{X}\). These predictions are subsequently used to re-fit the GLM for \(E(Y|X,L)\), with \(X\) replaced with \(\hat{X}\). The obtained coefficient(s) for \(X\) is the estimated causal effect.

Usage

tsls(fitX, fitY, control=FALSE, data, clusterid)

Arguments

fitX

an object of class "glm", as returned by the glm function in the stats package. This is the fitted GLM for \(E(X|Z,L)\).

fitY

an object of class "glm", as returned by the glm function in the stats package. This is the fitted GLM for \(E(Y|X,L)\). The model is assumed to have a specific form, see details below.

control

should the control function \(R=X-\hat{X}\) be used when re-fitting the GLM for \(E(Y|X,L)\)?

data

a data frame containing the variables in the model. The outcome, exposure, instrument and covariates can have arbitrary names, e.g. they don't need to be called Y, X, Z and L.

clusterid

an optional string containing the name of a cluster identification variable when data are clustered.

Value

An object of class "tsls" is a list containing

est

a vector containing the two-stage least squares estimate \(\hat{\psi}_{tsls}\).

vcov

the variance-covariance matric for the two-stage least squares estimate \(\hat{\psi}_{tsls}\), obtained with the sandwich formula.

Details

Let \(\eta\) be the link function in the GLM for \(E(Y|X,L)\). The model should be on the form $$\eta\{E(Y|X,L)\}=m(L;\psi)X+g(L;\beta),$$ e.g. \(E(Y|X,L)=\psi X+\beta_0+\beta_1 L\) or \(E(Y|X,L)=\psi_0 X+\psi_1 XL+\beta_0+\beta_1 L\). Let \(\hat{\psi}_{tsls}\) be the two-stage least squares estimator of \(\psi\), i.e. the MLE of \(\psi\) when \(X\) is replaced by \(\hat{X}\) in the model. Let \(Y_x\) be the potential outcome for a given subject, under exposure level \(X=x\). \(\hat{\psi}_{tsls}\) can be interpreted in at least two different ways. If \(\eta\) is the identity link, \(Z\) is a valid instrument, and the causal (structural nested) model $$A: \eta\{E(Y|X,Z,L)\}-\eta\{E(Y_0|X,Z,L)\}=m(L;\psi^*)X$$ holds, then \(\hat{\psi}_{tsls}\) is consistent for \(\psi^*\) in this model. Further, let \(U\) be the set of all confounders for the exposure-outcome association. If \(\eta\) is the identity link, \(Z\) is a valid instrument, and the causal model $$B: \eta\{E(Y_x|L,U)\}-\eta\{E(Y_0|L,U)\}=m(L;\psi^{**})X$$ holds, then \(\hat{\psi}_{tsls}\) is consistent for \(\psi^{**}\) in this model. When \(\eta\) is the identity link, model B implies model A, but not the other way around. When \(\eta\) is not the identity link, \(\hat{\psi}_{tsls}\) is generally inconsistent for both \(\psi^*\) and \(\psi^{**}\), even if \(Z\) is a valid instrument and models A and B hold. The bias is often reduced by using the control function \(R=X-\hat{X}\) as an additional regressor when refitting the GLM for \(E(Y|X,L)\). We refer to Vansteelandt et al (2011) for a thorough review of the underlying assumptions, the interpretation, and the asymptotic properties of \(\hat{\psi}_{tsls}\).

References

Vansteelandt S., Bowden J., Babanezhad M., Goetghebeur E. (2011). On instrumental variables estimation of causal odds ratios 26(3), 403-422.

Examples

Run this code
# NOT RUN {
##Example 1: identity link and no interaction
n <- 1000
psi <- 0.5
U <- rnorm(n) #confounder for X and Y
L <- rnorm(n) #confounder for Z and Y
Z <- rnorm(n) #instrument
X <- rnorm(n, mean=Z+L+U) #exposure 
Y <- rnorm(n, mean=psi*X+L+U) #outcome
data <- data.frame(L, Z, X, Y)
fitX <- glm(X~Z+L, data=data)
fitY <- glm(Y~X+L, data=data)
fitIV <- tsls(fitX=fitX, fitY=fitY, data=data)
summary(fitIV)

##Example 2: logistic link and interaction between X and L
n <- 1000
psi0 <- 1
psi1 <- 0.5
U <- rnorm(n) #confounder for X and Y
L <- rnorm(n) #confounder for Z and Y
Z <- rnorm(n) #instrument
X <- rnorm(n, mean=Z+L+U) #exposure
Y <- rbinom(n, 1, plogis(psi0*X+psi1*X*L+L+U)) #outcome
data <- data.frame(L, Z, X, Y)
fitX <- glm(X~Z+L, data=data)
fitY <- glm(Y~X+L+X*L, data=data, family="binomial")
fitIV <- tsls(fitX=fitX, fitY=fitY, data=data, control=TRUE)
summary(fitIV)


# }

Run the code above in your browser using DataLab