Our goal is to perform inference for the linear parameter in partially linear models with confounding variables. The standard double machine learning (DML) estimator of the linear parameter has a two-stage least squares interpretation, which can lead to a large variance and overwide confidence intervals. We apply regularization to reduce the variance of the estimator, which produces narrower confidence intervals that remain approximately valid.
The function regsdml estimates the linear parameter \(\beta_0\)
in the partially linear model
$$Y = X^T\beta_0 + g(W) + h(H) + \epsilon_Y$$
of the continuous response \(Y\)
with linear covariates
\(X\), nonlinear covariates \(W\), unobserved confounding
variables \(H\), and the error term \(\epsilon_Y\). An additional
variable \(A\) is required that is not part of the right-hand side
defining \(Y\). The variable \(A\) acts as an instrument after
\(W\) is regressed out of it.
The linear parameter \(\beta_0\) can be estimated with a two-stage least squares (TSLS) approach ("standard" DML) or with regularized approaches (regDML, regsDML). All approaches use double machine learning. The TSLS approach regresses the residual \(Y - E[Y|W]\) on \(X - E[X|W]\) using the instrument \(A - E[A|W]\). The regularized approaches minimize an objective function that equals \(\gamma\) times the objective function of TSLS plus an objective function that partials out \(A - E[A|W]\) from the residual quantity \(Y - E[Y|W] - (X - E[X|W])^T\beta\). The different regularization approaches choose different regularization parameters \(\gamma\). The conditional expectations act as nuisance parameters and are estimated with machine learning algorithms. All approaches use sample splitting and cross-fitting to estimate \(\beta_0\).
regsdml(
a, w, x, y, data = NULL,
DML = c("DML2", "DML1"),
K = 2L,
gamma = exp(seq(-4, 10, length.out = 100)),
aN = NULL,
do_regsDML = TRUE,
do_safety = FALSE,
do_DML = do_regDML || do_regsDML || do_safety,
do_regDML = FALSE,
do_regDML_all_gamma = FALSE,
safety_factor = 0.7,
cond_method = rep("spline", 3),
params = NULL,
level = 0.95,
S = 100L,
parallel = c("no", "multicore", "snow"),
ncpus = 1L,
cl = NULL
)A vector, matrix, or data frame. It acts as an instrument after
regressing out w of it.
Alternatively, if the data is
provided in the data frame data, a is a character vector
whose entries specify the columns of data acting as
"instrument" \(A\).
A vector, matrix, or data frame. Its columns contain observations
of the nonlinear predictors. Alternatively, if the data is
provided in the data frame data, w is a character vector
whose entries specify the columns of data acting as \(W\).
A vector, matrix, or data frame. This is the linear predictor.
Alternatively, if the data is provided in the data frame
data, x is a character vector whose entries specify
the columns of data acting as \(X\).
A vector, matrix, or data frame. This is the response.
Alternatively, if the data is provided in the data frame
data, y is a character vector whose entries specify
the columns of data acting as \(Y\).
An optional data frame. If it is specified, its column names
need to coincide with the character vectors specified in a,
w, x, and y.
Either "DML2" or "DML1" depending on which DML
method should be used. The default is "DML2".
The number of sample splits used for cross-fitting.
A vector specifying the grid of regularization parameters over which to optimize.
The \(N\)th element of a sequence of non-negative
real numbers diverging to \(+ \infty\) as the sample size
\(N\) tends to \(+ \infty\). By default, it equals
max(log(sqrt(N)), 1), where N denotes the sample size.
A boolean that specifies whether the regsDML estimator
is computed. It is set to TRUE by default.
A boolean that specifies whether a safety device is employed.
The safety device chooses the regularization parameter \(\gamma\)
such that the variance of the regularized estimator
is at least (100 * safety_factor)% of the variance of standard DML.
A boolean that specifies whether the standard DML estimator is
computed. It is set to TRUE by default if at least one of
do_regsDML, do_safety, or do_regDML is set to
TRUE.
A boolean that specifies whether the regularized DML
estimator regDML with the regularization parameter equal to a_N
times the \(\gamma\) leading to the lowest mean
squared error is computed. It is set to FALSE by default.
A boolean that specifies whether the regularized
estimators for all values \(\gamma\) of the grid gamma are
returned. It is set to FALSE by default.
The factor of the safety method. It is set to 0.7
by default.
A character vector of length 3 specifying the estimation
methods used to fit the conditional
expectations \(E[A|W]\), \(E[X|W]\), and \(E[Y|W]\).
Its components are from
from "spline", "forest",
"ols", "lasso", "ridge", and "elasticnet",
or it is a list of length 3 with components from "spline",
"forest",
"ols", "lasso", "ridge", and "elasticnet",
and where some components of the list are functions to estimate
the conditional expectations.
These functions have the input arguments
(yy_fit, ww_fit, ww_predict, params = NULL) and output the
conditional expectation of \(E[Y|W]\) estimated with yy_fit
and ww_fit and predicted with ww_predict.
The argument params is described below. The functions
return a matrix where the columns correspond to the component-wise
estimated conditional expectations.
Here, yy symbolically stands for either a,
x, or y.
Please see below for the default arguments
of the "spline", "forest", "ols", "lasso",
"ridge", and "elasticnet" methods.
An optional list of length 3. All 3 elements of this list are lists themselves. These lists specify additional input arguments for estimating the conditional expectations \(E[A|W]\), \(E[X|W]\), and \(E[Y|W]\), respectively.
Level for computing
confidence intervals for testing the two-sided component-wise
null hypotheses that test if a component equals zero
with the (approximate) asymptotic Gaussian distribution. The default is
0.95.
Number of replications to correct for the random
splitting of the sample. It is set to 100L by default.
One out of "no", "multicore", or "snow"
specifying the parallelization method used to compute the S
replications. The default is "no".
An integer specifying the number of cores used if
parallel is not set to "no".
An optional parallel or snow cluster if parallel = "snow".
The argument ncpus does not have to be specified if the
argument cl
is specified.
A list containing some of the lists
regsDML_statistics,
regDML_safety_statistics,
DML_statistics, regDML_statistics, and
regDML_all_gamma_statistics is returned.
The individual sublists contain the following arguments supplemented
by an additional suffix specifying the method they correspond to.
betaEstimator of the linear coefficient \(\beta_0\).
sdStandard error estimates of the respective entries
of beta.
varVariance-covariance matrix of beta.
pvalp-values for the respective entries of beta.
CITwo-sided confidence intervals
for \(\beta_0\) where the \(j\)th row of CI
corresponds to the two-sided testing of \(H_0: (\beta_0)_j=0\)
at level level. They are computed with the (approximate) asymptotic
Gaussian distribution of the coefficient estimates.
The list regsDML_statistics contains the following additional entries:
message_regsDMLSpecifies if regsDML selects the regularized estimator or DML.
gamma_aNChosen optimal regularization parameter if regsDML equals the regularized estimator. This entry is not present if DML is selected.
If the safety device is applicable, the list regDML_safety_statistics contains the following additional entries:
message_safetySpecifies whether the safety device was applicable.
gamma_safetyChosen regularization parameter of the safety device.
If the safety device is not applicable, the list regDML_safety_statistics contains message_safety as its only entry.
The list regDML_statistics contains the following additional entry:
gamma_optChosen optimal regularization parameter.
The list regDML_all_gamma_statistics is a list of the same length as the grid gamma, where each individual list is of the structure just described.
The estimator of \(\beta_0\) is computed using sample splitting and
cross-fitting.
Irrespective of which methods are performed,
the data is split into K sets that are equally large
if possible. For each such set, the nuisance parameters
(that is, the conditional expectations \(E[A|W]\), \(E[X|W]\),
and \(E[Y|W]\)) are estimated on its complement and evaluated on the
set itself. If DML = "DML1", then K individual
estimators are computed for each
of the K data sets and are then averaged. If DML = "DML2",
the nuisance parameter matrices are first assembled before the estimator
of \(\beta_0\) is computed. This enhances stability of the coefficient
estimator compared to "DML1". If K = 1, no sample splitting
is performed. In this case, the nuisance parameters are estimated and
predicted on the full sample.
The whole estimation procedure can be repeated S times to
account for the randomness introduced by the random sample splits.
The S repetitions can be run in parallel by specifying the
arguments parallel and ncpus.
The S estimators of \(\beta_0\) are aggregated by taking the
median of them. The S variance-covariance matrices are aggregated
by first adding a correction term to them that accounts for the random
splitting and by afterwards taking the median of the corrected
variance-covariance matrices. If \(d > 1\), it can happen that this
final matrix is not positive definite anymore, in which case the mean
is considered instead.
If the design in at least 0.5 * S of the S repetitions is
singular, an error message is displayed.
If the designs in some but less than 0.5 * S of the S
repetitions are singular, another S repetitions are performed.
If, in total, at least S repetitions result in a nonsingular design,
the results are returned together with a warning message.
The regularized estimators and their associated mean squared errors
(MSEs) are computed for the regularization parameters \(\gamma\) of
the grid gamma. These estimators are returned if the argument
do_regDML_all_gamma is set to TRUE.
The \(\gamma\)-value whose corresponding regularized estimator
from the do_regDML_all_gamma method achieves
the smallest MSE
is multiplied by aN, leading to \(\gamma'\).
The do_regDML_all_gamma estimator with regularization parameter
\(\gamma'\) is called regDML.
The regsDML estimator equals regDML or DML depending on whose variance
is smaller.
If \(\beta_0\) is of larger dimension than 1, the MSE computations and
the variance comparison step are performed with the sum of the diagonal
entries of the respective variance-covariance matrices.
If do_safety = TRUE, a \(\gamma\) value is chosen such that the
regularized estimator among do_regDML_all_gamma
with this value of \(\gamma\) has a variance
that is just not smaller than safety_factor times the variance of
DML.
If \(\beta_0\) is of larger dimension than 1, the sum of the diagonal
entries of the respective variance-covariance matrices is taken as
a measure of variance.
If the regularization scheme leads to considerable variance
reductions, it is possible that this safety device cannot be applied.
In this case, a respective message is returned.
The default options of the "spline", "forest",
"ols", "lasso", "ridge", and "elasticnet"
methods are as follows. With the "spline" method,
the function bs from the package splines is employed
with degree = 3 and df = ceiling(N ^ (1 / 5)) + 2 if
N satisfies (df + 1) * v + 1 > N, where v denotes
the number of columns of w and N denotes the sample size.
Otherwise, df is consecutively
reduced by 1 until this condition is satisfied.
The splines are fitted and predicted on different data sets.
If they are extrapolated, a warning message is displayed.
With the "forest" method, the function randomForest from
the package randomForest is employed with nodesize = 5,
ntree = 500, na.action = na.omit, and replace = TRUE.
With the "ols" method, the default arguments are used and no
additional arguments are specified.
With the "lasso" and "ridge" methods,
the function cv.glmnet from the package glmnet performs
10-fold cross validation by default (argument nfolds)
to find the one-standard-error-rule \(\lambda\)-parameter.
With the "elasticnet" method, the function cv.glmnet from
the package glmnet performs 10-fold cross validation
(argument nfolds) with alpha = 0.5 by default
to find the one-standard-error-rule \(\lambda\)-parameter.
All default values of the mentioned parameters can be adapted by
specifying the argument params.
There are three possibilities to set the argument parallel, namely
"no" for serial evaluation (default),
"multicore" for parallel evaluation using forking,
and "snow" for parallel evaluation using a parallel
socket cluster. It is recommended to select RNGkind
("L'Ecuyer-CMRG") and to set a seed to ensure that the parallel
computing of the package dmlalg is reproducible.
This ensures that each processor receives a different substream of the
pseudo random number generator stream.
Thus, the results reproducible if the arguments remain unchanged.
There is an optional argument cl to specify a custom cluster
if parallel = "snow".
The response y needs to be continuous.
The covariate w may contain factor variables in its columns.
If the variables a and x contain factor variables,
the factors should not be included as factor columns of a or
x.
Instead, dummy encoding should be used for all individual levels of the
factor.
That is, a factor with 4 levels should be encoded with 4 columns where each
column consists of 1 and 0 entries indicating the presence of the
respective level of the factor.
There are summary, confint, coef, vcov,
and print methods available
for objects fitted with regsdml. They are called
summary.regsdml,
confint.regsdml,
coef.regsdml,
vcov.regsdml, and
print.regsdml, respectively.
C. Emmenegger and P. B<U+00FC>hlmann. Regularizing Double Machine Learning in Partially Linear Endogenous Models, 2021. Preprint arXiv:2101.12525.
summary.regsdml,
confint.regsdml,
coef.regsdml,
vcov.regsdml
print.regsdml
# NOT RUN {
## Generate some data:
RNGkind("L'Ecuyer-CMRG")
set.seed(19)
# true linear parameter
beta0 <- 1
n <- 40
# observed confounder
w <- pi * runif(n, -1, 1)
# instrument
a <- 3 * tanh(2 * w) + rnorm(n, 0, 1)
# unobserved confounder
h <- 2 * sin(w) + rnorm(n, 0, 1)
# linear covariate
x <- -1 * abs(a) - h - 2 * tanh(w) + rnorm(n, 0, 1)
# response
y <- beta0 * x - 3 * cos(pi * 0.25 * h) + 0.5 * w ^ 2 + rnorm(n, 0, 1)
## Estimate the linear coefficient from x to y
## (The parameters are chosen small enough to make estimation fast):
## Caveat: A spline estimator is extrapolated, which raises a warning message.
## Extrapolation lies in the nature of our method. To omit the warning message
## resulting from the spline estimator, another estimator may be used.
fit <- regsdml(a, w, x, y,
gamma = exp(seq(-4, 1, length.out = 4)),
S = 3,
do_regDML_all_gamma = TRUE,
cond_method = c("forest", # for E[A|W]
"spline", # for E[X|W]
"spline"), # for E[Y|W]
params = list(list(ntree = 1), NULL, NULL))
## parm = c(2, 3) prints an additional summary for the 2nd and 3rd gamma-values
summary(fit, parm = c(2, 3),
correlation = TRUE,
print_gamma = TRUE)
confint(fit, parm = c(2, 3),
print_gamma = TRUE)
coef(fit) # coefficients
vcov(fit) # variance-covariance matrices
## Alternatively, provide the data in a single data frame
## (see also caveat above):
data <- data.frame(a = a, w = w, x = x, y = y)
fit <- regsdml(a = "a", w = "w", x = "x", y = "y", data = data,
gamma = exp(seq(-4, 1, length.out = 4)),
S = 3)
## With more realistic parameter choices:
if (FALSE) {
fit <- regsdml(a, w, x, y,
cond_method = c("forest", # for E[A|W]
"spline", # for E[X|W]
"spline")) # for E[Y|W]
summary(fit)
confint(fit)
## Alternatively, provide the data in a single data frame:
## (see also caveat above):
data <- data.frame(a = a, w = w, x = x, y = y)
fit <- regsdml(a = "a", w = "w", x = "x", y = "y", data = data)
}
# }
Run the code above in your browser using DataLab