Our goal is to perform inference for the linear parameter in partially linear mixed-effects models (PLMMs) with repeated measurements using double machine learning (DML).
The function mmdml estimates the linear parameter \(\beta_0\)
in the PLMM
$$Y_i = X_i\beta_0 + g(W_i) + Z_ib_i + \epsilon_{Y_i},
(i = 1, \ldots, N)$$
for the continuous response \(Y_i\)
with linear covariates
\(X_i\), nonlinear covariates \(W_i\), unobserved random effects
\(b_i\), and the error term \(\epsilon_{Y_i}\).
For each \(i\), there are \(n_i\) repeated observations available.
That is, \(Y_i\) is an \(n_i\)-dimensional vector.
The matrix \(Z_i\) is fixed. The random effects \(b_i\) and the
error terms \(\epsilon_{Y_i}\) are Gaussian distributed, independent,
and independent of \(b_j\) and \(\epsilon_{Y_j}\), respectively,
for \(i\neq j\). The linear and nonlineare covariates \(X_i\) and
\(W_i\) are random and independent of all random effects and error terms.
The linear parameter \(\beta_0\) can be estimated with a linear mixed-effects modeling approach with maximum likelihood after regressing out \(W_i\) nonparametrically from \(Y_i\) and \(X_i\) using machine learning algorithms. A linear mixed-effects model is estimated on the partialled-out data $$Y_i - E[Y_i|W_i] = (X_i - E[X_i|W_i])\beta_0 + Z_ib_i + \epsilon_{Y_i}.$$ The conditional expectations are estimated with machine learning algorithms and sample splitting, and cross-fitting is used to regain full efficiency of the estimator of \(beta_0\). This estimator is asymptotically Gaussian distributed and efficient.
mmdml(
w, x, y, z, data = NULL,
z_formula = NULL, group = "group",
K = 2L, S = 100L,
cond_method = rep("forest", 2),
params = NULL,
parallel = c("no", "multicore", "snow"),
ncpus = 1L, cl = NULL,
nr_random_eff = if (S > 5) 1L else S,
nr_res = nr_random_eff
)A vector, matrix, or data frame. Its columns contain observations
of the nonlinear predictors. Alternatively, if the data is
provided in the data frame data, w is a character vector
whose entries specify the columns of data acting as \(W\).
A vector, matrix, or data frame. This is the linear predictor.
Alternatively, if the data is provided in the data frame
data, x is a character vector whose entries specify
the columns of data acting as \(X\).
A vector, matrix, or data frame. This is the response.
Alternatively, if the data is provided in the data frame
data, y is a character vector whose entries specify
the columns of data acting as \(Y\).
A vector, matrix, or data frame. It acts as the fixed matrix
assigning the random effects.
Alternatively, if the data is
provided in the data frame data, z is a character vector
whose entries specify the columns of data acting as \(Z\).
A string specifying the structure of the random effect
using the notation as in lmer
from package lme4, e.g.,
(1|id) + (1|cask:id).
A string containing the name of the grouping variable in
zz.
An optional data frame. If it is specified, its column names
need to coincide with the character vectors specified in a,
w, x, and y.
The number of sample splits used for cross-fitting.
Number of replications to correct for the random
splitting of the sample. It is set to 100L by default.
A character vector of length 2 specifying the estimation
methods used to fit the conditional
expectations \(E[X|W]\) and \(E[Y|W]\).
Its components are from
from "spline", "forest",
"ols", "lasso", "ridge", and "elasticnet",
or it is a list of length 2 with components from "spline",
"forest",
"ols", "lasso", "ridge", and "elasticnet",
and where some components of the list are functions to estimate
the conditional expectations.
These functions have the input arguments
(yy_fit, ww_fit, ww_predict, params = NULL) and output the
conditional expectation of \(E[Y|W]\) estimated with yy_fit
and ww_fit and predicted with ww_predict.
The argument params is described below. The functions
return a matrix where the columns correspond to the component-wise
estimated conditional expectations.
Here, yy symbolically stands for either
x or y.
Please see below for the default arguments
of the "spline", "forest", "ols", "lasso",
"ridge", and "elasticnet" methods.
An optional list of length 2. The 2 elements of this list are lists themselves. These lists specify additional input arguments for estimating the conditional expectations \(E[X|W]\) and \(E[Y|W]\), respectively.
One out of "no", "multicore", or "snow"
specifying the parallelization method used to compute the S
replications. The default is "no".
An integer specifying the number of cores used if
parallel is not set to "no".
An optional parallel or snow cluster if parallel = "snow".
The argument ncpus does not have to be specified if the
argument cl
is specified.
An integer specifying the number of unaggregated sets
of random effect estimates among the S repetitions that should be
returned.
An integer specifying the number of unaggregated sets
of residual estimates among the S repetitions that should be
returned.
A list similar to the output of lmer
from package lme4 containing
the following entries.
betaEstimator of the linear coefficient \(\beta_0\).
vcovVariance-covariance matrix of beta.
Also see lmer. The S individual variance-covariance
matrices are aggregated by first adding a correction term to them
correcting for the randomness of the sample splits and by subsequently
taking the median of the corrected variance-covariance matrices.
sigmaPlease see lmer for its meaning.
It is computed by averaging over the K sample splits and by
aggregating the S repetitions using the median.
thetaPlease see lmer for its meaning.
It is computed by averaging over the K sample splits and by
aggregating the S repetitions using the median.
varcorVariance correlation components computed with
theta. Please also see lmer.
random_effConditional estimates of the random effects
similarly to lmer.
The individual sets of S random effects estimates are aggregated
using the mean.
random_eff_allThe first nr_random_eff sets of the
S sets of random effects estimates.
residualsThe first nr_res sets of the
S sets of residuals.
Each set of residuals is computed with parameters and data that is
aggregated over the K sample splits.
The other elements ngrps, nobs, fitMsgs, cnms, nc, nms, useSc, optinfo, and methTitle are as in lmer. The gradient and Hessian information of optinfo is computed by aggregating the respective information over the S repetitions with the median.
The estimator of \(\beta_0\) is computed using sample splitting and
cross-fitting.
The subject-specific data (over \(i = 1, \ldots, N\)) is split
into K sets that are equally large
if possible. For each such set, the nuisance parameters
(that is, the conditional expectations \(E[Y_i|W_i]\) and
\(E[X_i|W_i]\)) are estimated on its complement and evaluated on the
set itself.
Estimators of \(\beta_0\) and the variance parameters are computed
for each
of the K data sets and are then averaged.
If K = 1, no sample splitting
is performed. In this case, the nuisance parameters are estimated and
predicted on the full sample.
The whole estimation procedure is repeated S times to
account for the randomness introduced by the random sample splits.
The S repetitions can be run in parallel by specifying the
arguments parallel and ncpus.
The S estimators of \(\beta_0\) and the variance components
are aggregated by taking the
median of them. The S variance-covariance matrices of the estimator
of \(\beta_0\) are aggregated
by first adding a correction term to them that accounts for the random
splitting and by afterwards taking the median of the corrected
variance-covariance matrices. If \(d > 1\), it can happen that this
final matrix is not positive definite anymore, in which case the mean
is considered instead.
Estimates of the conditional random effects and the residuals are computed
in each of the S repetitions. A total number of nr_random_eff
and nr_res of them, respectively, is returned.
Additionally, the random effects estimates from all S repetitions
are aggregated using the mean and returned.
If the design in at least 0.5 * S of the S repetitions is
singular, an error message is displayed.
If the designs in some but less than 0.5 * S of the S
repetitions are singular, another S repetitions are performed.
If, in total, at least S repetitions result in a nonsingular design,
the results are returned together with a warning message.
The default options of the "spline", "forest",
"ols", "lasso", "ridge", and "elasticnet"
methods are as follows. With the "spline" method,
the function bs from the package splines is employed
with degree = 3 and df = ceiling(N ^ (1 / 5)) + 2 if
N satisfies (df + 1) * v + 1 > N, where v denotes
the number of columns of w and N denotes the sample size.
Otherwise, df is consecutively
reduced by 1 until this condition is satisfied.
The splines are fitted and predicted on different data sets.
If they are extrapolated, a warning message is displayed.
With the "forest" method, the function randomForest from
the package randomForest is employed with nodesize = 5,
ntree = 500, na.action = na.omit, and replace = TRUE.
With the "ols" method, the default arguments are used and no
additional arguments are specified.
With the "lasso" and "ridge" methods,
the function cv.glmnet from the package glmnet performs
10-fold cross validation by default (argument nfolds)
to find the one-standard-error-rule \(\lambda\)-parameter.
With the "elasticnet" method, the function cv.glmnet from
the package glmnet performs 10-fold cross validation
(argument nfolds) with alpha = 0.5 by default
to find the one-standard-error-rule \(\lambda\)-parameter.
All default values of the mentioned parameters can be adapted by
specifying the argument params.
There are three possibilities to set the argument parallel, namely
"no" for serial evaluation (default),
"multicore" for parallel evaluation using forking,
and "snow" for parallel evaluation using a parallel
socket cluster. It is recommended to select RNGkind
("L'Ecuyer-CMRG") and to set a seed to ensure that the parallel
computing of the package dmlalg is reproducible.
This ensures that each processor receives a different substream of the
pseudo random number generator stream.
Thus, the results reproducible if the arguments remain unchanged.
There is an optional argument cl to specify a custom cluster
if parallel = "snow".
The response y needs to be continuous.
The covariate w may contain factor variables in its columns.
If the variable x contains factor variables,
the factors should not be included as factor columns of x.
Instead, dummy encoding should be used for all individual levels of the
factor.
That is, a factor with 4 levels should be encoded with 4 columns where each
column consists of 1 and 0 entries indicating the presence of the
respective level of the factor.
There are confint, fixef, print, ranef,
residuals, sigma, summary, vcov,
and VarCorr methods available
for objects fitted with mmdml. They are called
confint.mmdml,
fixef.mmdml,
print.mmdml,
ranef.mmdml,
residuals.mmdml,
sigma.mmdml,
summary.mmdml,
vcov.mmdml, and
VarCorr.mmdml, respectively.
C. Emmenegger and P. B<U+00FC>hlmann. Double Machine Learning for Partially Linear Mixed-Effects Models with Repeated Measurements. Preprint arXiv:2108.13657.
confint,
fixef,
print,
ranef,
residuals,
sigma,
summary,
vcov,
VarCorr
# NOT RUN {
## generate data
RNGkind("L'Ecuyer-CMRG")
set.seed(19)
data1 <- example_data_mmdml(beta0 = 0.2)
data2 <- example_data_mmdml(beta0 = c(0.2, 0.2))
## fit models
## Caveat: Warning messages are displayed because the small number of
## observations results in a singular random effects model
fit1 <-
mmdml(w = c("w1", "w2", "w3"), x = "x1", y = "resp", z = c("id", "cask"),
data = data1, z_formula = "(1|id) + (1|cask:id)", group = "id", S = 3)
fit2 <-
mmdml(w = c("w1", "w2", "w3"), x = c("x1", "x2"), y = "resp", z = c("id", "cask"),
data = data2, z_formula = "(1|id) + (1|cask:id)", group = "id", S = 3)
## apply methods
confint(fit2)
fixef(fit2)
print(fit2)
ranef(fit2)
residuals(fit2)
sigma(fit2)
summary(fit2)
vcov(fit2)
VarCorr(fit2)
# }
Run the code above in your browser using DataLab