tmle.npvi: Targeted Minimum Loss Estimation of NPVI

Description

Carries out the targeted minimum loss estimation (TMLE) of a non-parametric variable importance measure of a continuous exposure.

Usage

tmle.npvi(obs, f = identity, nMax = 30L, flavor = c("learning",  "superLearning"), lib = list(), nodes = 1L, cvControl = NULL,  family = c("parsimonious", "gaussian"), cleverCovTheta = FALSE,  bound = 1, B = 1e+05, trueGMu = NULL, iter = 5L, stoppingCriteria = list(mic = 0.01,  div = 0.01, psi = 0.1), gmin = 0.05, gmax = 0.95, mumin = quantile(f(obs[obs[,  "X"] != 0, "X"]), type = 1, probs = 0.01), mumax = quantile(f(obs[obs[,  "X"] != 0, "X"]), type = 1, probs = 0.99), verbose = FALSE,  tabulate = TRUE, exact = TRUE, light = TRUE)

Arguments

obs

A n x p matrix or data frame of observations, with $p \ge 3$.

Column "X" corresponds to the continuous exposure variable (e.g. DNA copy number), or "cause" in a causal model, with a reference value $x_0$ equal to 0.
Column "Y" corresponds to the outcome variable (e.g. gene expression level), or "effect" in a causal model.
All other columns are interpreted as baseline covariates $W$ taken into account in the definition of the "effect" of $X$ on $Y$.

A function involved in the definition of the parameter of interest $\psi$, which must satisfy $f(0)=0$ (see Details). Defaults to identity.

nMax

An integer (defaults to 30L; 10L is the smallest authorized value and we recommend a value less than 50L for reasonable computational time) indicating the maximum number of observed values of $X\neq 0$ which are used to create the supports of the conditional distributions of $X$ given $W$ and $X\neq0$ involved in the simulation under $P_n^k$ when family is set to "parsimonious".

flavor

Indicates whether the construction of the relevant features of $P_n^0$ and $P_n^k$, the (non-targeted yet) initial and (targeted) successive updated estimators of the true distribution of $(W,X,Y)$ relies on the Super Learning methodology (option "superLearning") or not (option "learning", default value). In the former case, the SuperLearner package is loaded.

lib

A list providing the function tmle.npvi with the necessary algorithms involved in the estimation of the features of interest. If empty (default) then the default algorithms are used. See Details.

nodes

An integer, which indicates how many nodes should be involved in the computation of the TMLE when it relies on the "superLearning" flavor. Defaults to 1. If larger than 1, then the parallel package is loaded and a cluster with nodes nodes is created and exploited.

cvControl

'NULL' (default value) or an integer indicating how many folds are involved in the Super Learning procedure. If flavor and cvControl are simultaneously set to "superLearning" and NULL then the Super Learning procedure relies on 10-fold cross-validation.

family

Indicates whether the simulation of the conditional distribution of $X$ given $W$ under $P_n^k$ (the initial estimator if $k=0$ or its $k$th update if $k \ge 1$) should be based on a weighted version of the empirical measure (case "parsimonious", default value and faster execution) or on a Gaussian model (case "gaussian").

cleverCovTheta

A logical, indicating whether the one-step (if FALSE, default value) or the two-step (if TRUE) updating procedure should be carried out.

bound

A positive numeric (defaults to 1), upper-bound on the absolute value of the fluctuation parameter.

An integer (defaults to 1e5) indicating the sample size of the data set simulated under each $P_n^k$ to compute an approximated value of $\Psi(P_n^k)$), the parameter of interest at $P_n^k$. The larger B, the more accurate the approximation.

trueGMu

Either NULL (default value) if the true conditional probability $g(W)=P(X=0|W)$ and conditional expectation $muAux(W)=E_P(X|X\neq 0, W)$ are not both known beforehand. Otherwise, a list with tags g and muAux and respective values the two respective functions (see the related outputs of getSample).

iter

An integer (defaults to 5) indicating a maximal number of updates.

stoppingCriteria

A list providing tuning parameters for the stopping criteria. Defaults to list(mic=0.01, div=0.01, psi=0.1), see Details.

gmin

A positive numeric, lower-bound on the range of values of the estimated probabilities $P_n^k(X=0|W)$ that $X$ be equal to its reference value 0 given $W$. Defaults to 5e-2, and must be smaller than gmax.

gmax

A positive numeric smaller than 1, upper-bound on the range of values of the estimated probabilities $P_n^k(X=0|W)$ that $X$ be equal to its reference value 0 given $W$. Defaults to 95e-2, and must be larger than gmin.

mumin

A numeric, lower-bound on the range of values of the estimated conditional expectation $E_{P_n^k}(X|X\neq 0,W)$ of $X$ given $X\neq 0$ and $W$. Defaults to the first percentile of X, and must be smaller than mumax.

mumax

A numeric, upper-bound on the range of values of the estimated conditional expectation $E_{P_n^k}(X|X\neq 0,W)$ of $X$ given $X\neq 0$ and $W$. Defaults to the 99th percentile of $X$, and must be larger than mumin.

verbose

Prescribes the amount of information output by the function. Defaults to FALSE.

tabulate

A logical, kept for compatibility, which should be set to TRUE (its default value). This requires that the joint distribution of $(X,W)$ under $P_n^k$ be a weighted version of the empirical measure (for a faster execution).

exact

A logical, kept for compatibility, which should be set to TRUE (its default value). This requires that the successive updates of the estimators of the relevant features of the true distribution of $(W,X,Y)$ be derived from the current estimators based on exact formulas rather than on their first-order approximations.

light

A logical, kept for compatibility, which should be set to TRUE (its default value). This requires that the result of each fit be reduced in size (for a faster execution). Currently implemented only for flavor learning.

Value

obs: The matrix of observations used to carry out the TMLE procedure. Use the method getObs to retrieve it.
psi: The TMLE of the parameter of interest. Use the method getPsi to retrieve it.
psi.sd: The estimated standard deviation of the TMLE of the parameter of interest. Use the method getPsiSd to retrieve it.

Details

The parameter of interest is defined as $\psi=\Psi(P)$ with $$\Psi(P) = \frac{E_P[f(X) * (\theta(X,W) - \theta(0,W))]}{E_P[f(X)^2]},$$ with $P$ the distribution of the random vector $(W,X,Y)$, $\theta(X,W) = E_P[Y|X,W]$, $0$ the reference value for $X$, and $f$ a user-supplied function such that $f(0)=0$ (e.g., $f=identity$, the default value).

The TMLE procedure stops when the maximal number of iterations, iter, is reached or when at least one of the following criteria is met:

The empirical mean $P_n effIC(P_n^{k+1})$ of the efficient influence curve at $P_n^{k+1}$ scaled by the estimated standard deviation of the efficient influence curve at $P_n^{k+1}$ is smaller, in absolute value, than mic.
The total variation (TV) distance between P_n^k and P_n^k+1 is smaller than div.
The change between the successive values $Psi(P_n^k)$ and $Psi(P_n^{k+1})$ is smaller than psi.

If lib is an empty list (list(), default value) then the default algorithms for the chosen flavor are loaded (learningLib when flavor is set to "learning" or superLearningLib when flavor is set to "superLearning"). A valid lib argument must mimick the structure of either learningLib or superLearningLib, depending on flavor.

The "superLearning" flavor requires the SuperLearner package and, by default, the e1071, gam, glmnet, polspline and randomForest packages.

If family is set to "parsimonious" (recommended) then the packages sgeostat and geometry are required.

References

Chambaz, A., Neuvial, P., & van der Laan, M. J. (2012). Estimation of a non-parametric variable importance measure of a continuous exposure. Electronic journal of statistics, 6, 1059--1099.

Chambaz, A., Neuvial, P. (2015). tmle.npvi: targeted, integrative search of associations between DNA copy number and gene expression, accounting for DNA methylation. To appear in Bioinformatics Applications Notes.

Examples

Run this code

set.seed(12345)
##
## Simulating a data set and computing the true value of the parameter
##

## Parameters for the simulation (case 'f=identity')
O <- cbind(W=c(0.05218652, 0.01113460),
           X=c(2.722713, 9.362432),
           Y=c(-0.4569579, 1.2470822))
O <- rbind(NA, O)
lambda0 <- function(W) {-W}
p <- c(0, 1/2, 1/2)
omega <- c(0, 3, 3)
S <- matrix(c(10, 1, 1, 0.5), 2 ,2)

## Simulating a data set of 200 i.i.d. observations
sim <- getSample(2e2, O, lambda0, p=p, omega=omega, sigma2=1, Sigma3=S)
obs <- sim$obs

## Adding (dummy) baseline covariates
V <- matrix(runif(3*nrow(obs)), ncol=3)
colnames(V) <- paste("V", 1:3, sep="")
obs <- cbind(V, obs)

## Caution! MAKING '0' THE REFERENCE VALUE FOR 'X'
X0 <- O[2,2]
obsC <- obs
obsC[, "X"] <- obsC[, "X"] - X0
obs <- obsC

## True psi and confidence intervals (case 'f=identity')      
sim <- getSample(1e4, O, lambda0, p=p, omega=omega, sigma2=1, Sigma3=S)
truePsi <- sim$psi

confInt0 <- truePsi + c(-1, 1)*qnorm(.975)*sqrt(sim$varIC/nrow(sim$obs))
confInt <- truePsi + c(-1, 1)*qnorm(.975)*sqrt(sim$varIC/nrow(obs))
cat("\nCase f=identity:\n")
msg <- paste("\ttrue psi is: ", signif(truePsi, 3), "\n", sep="")
msg <- paste(msg, "\t95%-confidence interval for the approximation is: ",
             signif(confInt0, 3), "\n", sep="")
msg <- paste(msg, "\toptimal 95%-confidence interval is: ",
             signif(confInt, 3), "\n", sep="")
cat(msg)

##
## TMLE procedure
##

## Running the TMLE procedure
npvi <- tmle.npvi(obs, f=identity, flavor="learning", B=5e4, nMax=10)

## Summarizing its results
npvi
setConfLevel(npvi, 0.9)
npvi

history <- getHistory(npvi)
print(round(history, 4))

hp <- history[, "psi"]
hs <- history[, "sic"]
hs[1] <- NA
ics <-  c(-1,1) %*% t(qnorm(0.975)*hs/sqrt(nrow(getObs(npvi))))

pch <- 20
ylim <- range(c(confInt, hp, ics+hp), na.rm=TRUE)

xs <- (1:length(hs))-1
plot(xs, hp, ylim=ylim, pch=pch, xlab="Iteration", ylab=expression(psi[n]),
     xaxp=c(0, length(hs)-1, length(hs)-1))
dummy <- sapply(seq(along=xs), function(x) lines(c(xs[x],xs[x]), hp[x]+ics[, x]))

abline(h=confInt, col=4)
abline(h=confInt0, col=2)

Run the code above in your browser using DataLab