copulaIMA: Copula-based Instrumental Variable Model (IMA)

Description

Fitting Haschka's copula-based Instrumental Variable Model (IMA) using Gaussian copulas to address endogeneity without external instruments, while allowing dependence between endogenous and exogenous regressors.

The copula-based IMA approach corrects endogeneity by constructing control functions from the marginal distributions of the endogenous regressors and incorporating them into the estimation equation. Inference is based on bootstrapping.

The key contribution of IMA over the original Park and Gupta (2012) copula correction (copulaCorrection) is that it does not require the endogenous and exogenous regressors to be independent. When this independence assumption is violated, which is common in practice, the original copula correction yields severely biased estimates of all model coefficients. IMA corrects for this by exploiting the content of the exogenous variables through a first-stage auxiliary regression, in a similar way to instrumental variable identification.

The IMA method requires a normally distributed structural error. At least one endogenous regressor $P_{i,k}$ is continuously and nonnormally distributed. It does not require that the exogenous regressors $X_i$ be independent of the endogenous regressors $P_{i,k}$ or to satisfy any distributional requirement, i.e., $X_i$ may be binary or non-continuously distributed. This method supports only continuous endogenous regressors.

Usage

copulaIMA(
  formula,
  data,
  cdf = c("adj.ecdf", "resc.ecdf", "ecdf", "kde"),
  num.boots = 1000,
  verbose = TRUE
)

Value

An object of class rendo.copula.ima which is a list that contains:

formula: The formula given to specify the fitted model.
model: The model.frame used for model fitting.
terms: The terms object used for model fitting.
coefficients: A named vector of all coefficients resulting from model fitting.
names.main.coefs: A vector specifying which coefficients are from the model. For internal usage.
fitted.values: Fitted values of the structural model.
residuals: The structural residuals.
boots.params: The bootstrapped coefficients.
n.boots.attempted: The number of bootstrap iterations that were attempted.
n.boots.failed: The number of bootstrap iterations resulting in a failed fit.
cdf: The used cdf function.
names.endo.regs: The names of the continuous endogenous regressors.
res.lm.augmented: The fitted augmented regression model, including the control function terms.

Arguments

formula

A symbolic description of the model to be fitted. See the "Details" section for the exact notation.

data

A data.frame containing the data of all parts specified in the formula parameter.

cdf

Character string specifying the method used to estimate the marginal distribution functions of the endogenous regressors. One of "adj.ecdf", "resc.ecdf", "ecdf", or "kde".

"adj.ecdf": Adjusted empirical CDF with midrank correction (Liengaard et al., 2024). Keeps values strictly inside (0,1).

"resc.ecdf"

Rescaled empirical CDF via copula::pobs (Qian et al., 2024).

"ecdf"

Empirical CDF with boundary replacement (Becker et al., 2022).

"kde"

Integral of a density estimator via ks::kcde used in (Park and Gupta, 2012).

num.boots

Positive integer giving the number of bootstrap replications used for standard error estimation. A minimum of 1000 is recommended.

verbose

Show details about the running of the function.

Details

Model

Consider the structural regression model with $K$ endogenous regressors:

$$Y_i = \mu + \sum_{k=1}^{K} P_{i,k} \alpha_k + X_i' \beta + \varepsilon_i$$

where $i = 1, \ldots, n$ is the number of observations, $Y_i$ is the dependent variable, $P_{i,k}$ are continuous endogenous regressors that may be correlated with both the structural error $\varepsilon_i$ and the exogenous regressors $X_i$, $X_i$ is a vector of exogenous regressors uncorrelated with $\varepsilon_i$, and $\mu, \alpha_k, \beta$ are the structural model parameters.

IMA is an augmented OLS estimator. The estimation proceeds in three steps:

For every explanatory variable $W \in \{X_1, \ldots, X_J, P_1, \ldots, P_K\}$, compute the normal score $W^* = \Phi^{-1}(\hat{F}_W(W))$, where $\hat{F}_W$ is the estimated marginal CDF of $W$ and $\Phi^{-1}$ is the standard normal quantile function.
For each endogenous regressor $P_{i,k}^*$, regress it on the normal scores of the exogenous regressors $X_1^*, \ldots, X_J^*$ without intercept and retain the residuals $\hat{\varepsilon}_k$ as the copula correction terms. This first-stage regression exploits the dependence between endogenous and exogenous regressors in a way analogous to instrumental variable identification.
Augment the structural model with $\hat{\varepsilon}_1, \ldots, \hat{\varepsilon}_{K}$ as additional regressors and estimate by OLS.

Haschka (2025) reported that simulation results showed that copula-based IMA estimator may exhibit a slightly larger bias than alternative two-stage approaches when the sample size is very small (e.g., n = 100) and with intercept. However, the bias decreases rapidly as sample size increases and become negligible for moderate sample sizes (around n >= 600). The estimator appears asymptotically unbiased.

Bootstrap inference is performed by resampling the data with replacement. Degenerate bootstrap samples (e.g. singular design matrices or failed model estimation) are discarded and resampled until the requested number of valid bootstrap replications is obtained. The percentage of discarded samples is reported as a warning. Confidence intervals are computed using only successful bootstrap replications.

Formula interface

The formula argument follows a two-part notation separated by |. The first part specifies the structural model. The second part identifies the continuous endogenous regressors using continuous():

y ~ X + P | continuous(P)                          # typical use: with intercept

y ~ X + P1 + P2 | continuous(P1) + continuous(P2)  # two endogenous regressors

y ~ X + P - 1 | continuous(P)                      # no intercept (in simulation settings)

In typical applied settings the model includes an intercept. The no-intercept specification (-1) is primarily used in simulation studies following the design of (Haschka (2025) Section 4.1) and is unlikely to be appropriate for real datasets.

References

Haschka, R. E. (2025). Robustness of copula-correction models in causal analysis: Exploiting between-regressor correlation. IMA Journal of Management Mathematics, 36, 161-180. tools:::Rd_expr_doi("10.1093/imaman/dpae018")

Park, S. and Gupta, S., (2012), "Handling Endogenous Regressors by Joint Estimation Using Copulas", Marketing Science, 31(4), 567-86. tools:::Rd_expr_doi("10.1287/mksc.1120.0718")

Becker, JM., Proksch, D., & Ringle, C.M. Revisiting Gaussian copulas to handle endogenous regressors. Journal of the Academy of Marketing Science, 50, 46--66 (2022). tools:::Rd_expr_doi("10.1007/s11747-021-00805-y")

Liengaard, B. D., Becker, J.-M., Bennedsen, M., Heiler, P., Taylor, L. N., & Ringle, C. M. (2025). Dealing with regression models' endogeneity by means of an adjusted estimator for the Gaussian copula approach. Journal of the Academy of Marketing Science, 53(1), 279-299 tools:::Rd_expr_doi("10.1007/s11747-024-01055-4")

Qian, Y., Koschmann, A., and Xie, H. (2024). A practical guide to endogeneity correction using copulas. NBER Working Paper No. w32231. tools:::Rd_expr_doi("10.2139/ssrn.4754776")

Examples

Run this code

#------------------------------------------------------------------------
# Example 1: Single endogenous regressor with continuous
# and normal exogenous regressor (Haschka 2025, Section 4.1 Scenario 1)
# True values: alpha = 1 (P), beta = 1 (X), no intercept
#------------------------------------------------------------------------
data(dataCopIMAContExo)
res <- copulaIMA(
  y ~ X + P - 1 | continuous(P),
  data = dataCopIMAContExo,
  cdf = "adj.ecdf",
  num.boots = 1000
)
summary(res)

# \donttest{
#------------------------------------------------------------------------
# Example 2: Two endogenous regressors with intercept and no exogenous regressor
# True values: mu=10, alpha1 = 1 (P1), alpha2 = 1 (P2)
# Extension of the first example
#------------------------------------------------------------------------
data("dataCopIMAMultiEndo")
res2 <- copulaIMA(
  # Alternative: y ~ P1 + P2 | continuous(P1, P2)
  y ~ P1 + P2 | continuous(P1) + continuous(P2),
  data = dataCopIMAMultiEndo,
  cdf = "adj.ecdf",
  num.boots = 1000
)
summary(res2)

#------------------------------------------------------------------------
# Example 3: Single endogenous regressor with binary exogenous regressor
# (Haschka 2025, section 4.1 scenario 2)

# This example shows one of the key example of IMA method, i.e., the exogenous
# regressor does not need to be continuous or normally distributed.
# X the exogenous regressor is binary (0 or 1).
# True values: alpha = 1 (P), beta = 1 (X), no intercept.
#------------------------------------------------------------------------
data("dataCopIMABinExo")
res3 <- copulaIMA(
   y ~ X + P - 1 | continuous(P),
   data = dataCopIMABinExo,
   cdf = "adj.ecdf",
   num.boots = 1000
)
summary(res3)
# }

Run the code above in your browser using DataLab