mmdml: Estimating linear coefficients in partially linear mixed-effects models with repeated measurements using double machine learning.

Description

Our goal is to perform inference for the linear parameter in partially linear mixed-effects models (PLMMs) with repeated measurements using double machine learning (DML).

The function mmdml estimates the linear parameter $\beta_0$ in the PLMM $$Y_i = X_i\beta_0 + g(W_i) + Z_ib_i + \epsilon_{Y_i}, (i = 1, \ldots, N)$$ for the continuous response $Y_i$ with linear covariates $X_i$, nonlinear covariates $W_i$, unobserved random effects $b_i$, and the error term $\epsilon_{Y_i}$. For each $i$, there are $n_i$ repeated observations available. That is, $Y_i$ is an $n_i$-dimensional vector. The matrix $Z_i$ is fixed. The random effects $b_i$ and the error terms $\epsilon_{Y_i}$ are Gaussian distributed, independent, and independent of $b_j$ and $\epsilon_{Y_j}$, respectively, for $i\neq j$. The linear and nonlineare covariates $X_i$ and $W_i$ are random and independent of all random effects and error terms.

The linear parameter $\beta_0$ can be estimated with a linear mixed-effects modeling approach with maximum likelihood after regressing out $W_i$ nonparametrically from $Y_i$ and $X_i$ using machine learning algorithms. A linear mixed-effects model is estimated on the partialled-out data $$Y_i - E[Y_i|W_i] = (X_i - E[X_i|W_i])\beta_0 + Z_ib_i + \epsilon_{Y_i}.$$ The conditional expectations are estimated with machine learning algorithms and sample splitting, and cross-fitting is used to regain full efficiency of the estimator of $beta_0$. This estimator is asymptotically Gaussian distributed and efficient.

Usage

mmdml(
  w, x, y, z, data = NULL,
  z_formula = NULL, group = "group",
  K = 2L, S = 100L,
  cond_method = rep("forest", 2),
  params = NULL,
  parallel = c("no", "multicore", "snow"),
  ncpus = 1L, cl = NULL,
  nr_random_eff = if (S > 5) 1L else S,
  nr_res = nr_random_eff
)

Arguments

A vector, matrix, or data frame. Its columns contain observations of the nonlinear predictors. Alternatively, if the data is provided in the data frame data, w is a character vector whose entries specify the columns of data acting as $W$.

A vector, matrix, or data frame. This is the linear predictor. Alternatively, if the data is provided in the data frame data, x is a character vector whose entries specify the columns of data acting as $X$.

A vector, matrix, or data frame. This is the response. Alternatively, if the data is provided in the data frame data, y is a character vector whose entries specify the columns of data acting as $Y$.

A vector, matrix, or data frame. It acts as the fixed matrix assigning the random effects. Alternatively, if the data is provided in the data frame data, z is a character vector whose entries specify the columns of data acting as $Z$.

z_formula

A string specifying the structure of the random effect using the notation as in lmer from package lme4, e.g., (1|id) + (1|cask:id).

group

A string containing the name of the grouping variable in zz.

data

An optional data frame. If it is specified, its column names need to coincide with the character vectors specified in a, w, x, and y.

The number of sample splits used for cross-fitting.

Number of replications to correct for the random splitting of the sample. It is set to 100L by default.

cond_method

A character vector of length 2 specifying the estimation methods used to fit the conditional expectations $E[X|W]$ and $E[Y|W]$. Its components are from from "spline", "forest", "ols", "lasso", "ridge", and "elasticnet", or it is a list of length 2 with components from "spline", "forest", "ols", "lasso", "ridge", and "elasticnet", and where some components of the list are functions to estimate the conditional expectations. These functions have the input arguments (yy_fit, ww_fit, ww_predict, params = NULL) and output the conditional expectation of $E[Y|W]$ estimated with yy_fit and ww_fit and predicted with ww_predict. The argument params is described below. The functions return a matrix where the columns correspond to the component-wise estimated conditional expectations. Here, yy symbolically stands for either x or y. Please see below for the default arguments of the "spline", "forest", "ols", "lasso", "ridge", and "elasticnet" methods.

params

An optional list of length 2. The 2 elements of this list are lists themselves. These lists specify additional input arguments for estimating the conditional expectations $E[X|W]$ and $E[Y|W]$, respectively.

parallel

One out of "no", "multicore", or "snow" specifying the parallelization method used to compute the S replications. The default is "no".

ncpus

An integer specifying the number of cores used if parallel is not set to "no".

An optional parallel or snow cluster if parallel = "snow". The argument ncpus does not have to be specified if the argument cl is specified.

nr_random_eff

An integer specifying the number of unaggregated sets of random effect estimates among the S repetitions that should be returned.

nr_res

An integer specifying the number of unaggregated sets of residual estimates among the S repetitions that should be returned.

Value

A list similar to the output of lmer from package lme4 containing the following entries.

beta

Estimator of the linear coefficient $\beta_0$.

vcov

Variance-covariance matrix of beta. Also see lmer. The S individual variance-covariance matrices are aggregated by first adding a correction term to them correcting for the randomness of the sample splits and by subsequently taking the median of the corrected variance-covariance matrices.

sigma

Please see lmer for its meaning. It is computed by averaging over the K sample splits and by aggregating the S repetitions using the median.

theta

Please see lmer for its meaning. It is computed by averaging over the K sample splits and by aggregating the S repetitions using the median.

varcor

Variance correlation components computed with theta. Please also see lmer.

random_eff

Conditional estimates of the random effects similarly to lmer. The individual sets of S random effects estimates are aggregated using the mean.

random_eff_all

The first nr_random_eff sets of the S sets of random effects estimates.

residuals

The first nr_res sets of the S sets of residuals. Each set of residuals is computed with parameters and data that is aggregated over the K sample splits.

The other elements ngrps, nobs, fitMsgs, cnms, nc, nms, useSc, optinfo, and methTitle are as in lmer. The gradient and Hessian information of optinfo is computed by aggregating the respective information over the S repetitions with the median.

Details

The estimator of $\beta_0$ is computed using sample splitting and cross-fitting. The subject-specific data (over $i = 1, \ldots, N$) is split into K sets that are equally large if possible. For each such set, the nuisance parameters (that is, the conditional expectations $E[Y_i|W_i]$ and $E[X_i|W_i]$) are estimated on its complement and evaluated on the set itself. Estimators of $\beta_0$ and the variance parameters are computed for each of the K data sets and are then averaged. If K = 1, no sample splitting is performed. In this case, the nuisance parameters are estimated and predicted on the full sample.

The whole estimation procedure is repeated S times to account for the randomness introduced by the random sample splits. The S repetitions can be run in parallel by specifying the arguments parallel and ncpus. The S estimators of $\beta_0$ and the variance components are aggregated by taking the median of them. The S variance-covariance matrices of the estimator of $\beta_0$ are aggregated by first adding a correction term to them that accounts for the random splitting and by afterwards taking the median of the corrected variance-covariance matrices. If $d > 1$, it can happen that this final matrix is not positive definite anymore, in which case the mean is considered instead. Estimates of the conditional random effects and the residuals are computed in each of the S repetitions. A total number of nr_random_eff and nr_res of them, respectively, is returned. Additionally, the random effects estimates from all S repetitions are aggregated using the mean and returned.

If the design in at least 0.5 * S of the S repetitions is singular, an error message is displayed. If the designs in some but less than 0.5 * S of the S repetitions are singular, another S repetitions are performed. If, in total, at least S repetitions result in a nonsingular design, the results are returned together with a warning message.

The default options of the "spline", "forest", "ols", "lasso", "ridge", and "elasticnet" methods are as follows. With the "spline" method, the function bs from the package splines is employed with degree = 3 and df = ceiling(N ^ (1 / 5)) + 2 if N satisfies (df + 1) * v + 1 > N, where v denotes the number of columns of w and N denotes the sample size. Otherwise, df is consecutively reduced by 1 until this condition is satisfied. The splines are fitted and predicted on different data sets. If they are extrapolated, a warning message is displayed. With the "forest" method, the function randomForest from the package randomForest is employed with nodesize = 5, ntree = 500, na.action = na.omit, and replace = TRUE. With the "ols" method, the default arguments are used and no additional arguments are specified. With the "lasso" and "ridge" methods, the function cv.glmnet from the package glmnet performs 10-fold cross validation by default (argument nfolds) to find the one-standard-error-rule $\lambda$-parameter. With the "elasticnet" method, the function cv.glmnet from the package glmnet performs 10-fold cross validation (argument nfolds) with alpha = 0.5 by default to find the one-standard-error-rule $\lambda$-parameter. All default values of the mentioned parameters can be adapted by specifying the argument params.

There are three possibilities to set the argument parallel, namely "no" for serial evaluation (default), "multicore" for parallel evaluation using forking, and "snow" for parallel evaluation using a parallel socket cluster. It is recommended to select RNGkind ("L'Ecuyer-CMRG") and to set a seed to ensure that the parallel computing of the package dmlalg is reproducible. This ensures that each processor receives a different substream of the pseudo random number generator stream. Thus, the results reproducible if the arguments remain unchanged. There is an optional argument cl to specify a custom cluster if parallel = "snow".

The response y needs to be continuous. The covariate w may contain factor variables in its columns. If the variable x contains factor variables, the factors should not be included as factor columns of x. Instead, dummy encoding should be used for all individual levels of the factor. That is, a factor with 4 levels should be encoded with 4 columns where each column consists of 1 and 0 entries indicating the presence of the respective level of the factor.

There are confint, fixef, print, ranef, residuals, sigma, summary, vcov, and VarCorr methods available for objects fitted with mmdml. They are called confint.mmdml, fixef.mmdml, print.mmdml, ranef.mmdml, residuals.mmdml, sigma.mmdml, summary.mmdml, vcov.mmdml, and VarCorr.mmdml, respectively.

References

C. Emmenegger and P. B<U+00FC>hlmann. Double Machine Learning for Partially Linear Mixed-Effects Models with Repeated Measurements. Preprint arXiv:2108.13657.

Examples

Run this code

# NOT RUN {
## generate data
RNGkind("L'Ecuyer-CMRG")
set.seed(19)
data1 <- example_data_mmdml(beta0 = 0.2)
data2 <- example_data_mmdml(beta0 = c(0.2, 0.2))

## fit models
## Caveat: Warning messages are displayed because the small number of
## observations results in a singular random effects model
fit1 <-
  mmdml(w = c("w1", "w2", "w3"), x = "x1", y = "resp", z = c("id", "cask"),
        data = data1, z_formula = "(1|id) + (1|cask:id)", group = "id", S = 3)

fit2 <-
  mmdml(w = c("w1", "w2", "w3"), x = c("x1", "x2"), y = "resp", z = c("id", "cask"),
        data = data2, z_formula = "(1|id) + (1|cask:id)", group = "id", S = 3)

## apply methods
confint(fit2)
fixef(fit2)
print(fit2)
ranef(fit2)
residuals(fit2)
sigma(fit2)
summary(fit2)
vcov(fit2)
VarCorr(fit2)
# }

Run the code above in your browser using DataLab