`tmle.npvi(obs, f = identity, nMax = 30L, flavor = c("learning", "superLearning"), lib = list(), nodes = 1L, cvControl = NULL, family = c("parsimonious", "gaussian"), cleverCovTheta = FALSE, bound = 1, B = 1e+05, trueGMu = NULL, iter = 5L, stoppingCriteria = list(mic = 0.01, div = 0.01, psi = 0.1), gmin = 0.05, gmax = 0.95, mumin = quantile(f(obs[obs[, "X"] != 0, "X"]), type = 1, probs = 0.01), mumax = quantile(f(obs[obs[, "X"] != 0, "X"]), type = 1, probs = 0.99), verbose = FALSE, tabulate = TRUE, exact = TRUE, light = TRUE)`

obs

A

`n x p`

`matrix`

or `data frame`

of observations, with
$p \ge 3$. - Column
`"X"`

corresponds to the continuous exposure variable (e.g. DNA copy number), or "cause" in a causal model, with a reference value $x_0$ equal to 0. - Column
`"Y"`

corresponds to the outcome variable (e.g. gene expression level), or "effect" in a causal model. - All other columns are interpreted as baseline covariates $W$ taken into account in the definition of the "effect" of $X$ on $Y$.

f

A

`function`

involved in the definition of the parameter of interest
$\psi$, which must satisfy $f(0)=0$ (see Details). Defaults to
`identity`

.nMax

An

`integer`

(defaults to `30L`

; `10L`

is the
smallest authorized value and we recommend a value less than
`50L`

for reasonable computational time) indicating the maximum number of
observed values of $X\neq 0$ which are used to create the
supports of the conditional distributions of $X$ given $W$
and $X\neq0$ involved in the simulation under $P_n^k$ when
`family`

is set to "parsimonious".flavor

Indicates whether the construction of the relevant features of $P_n^0$
and $P_n^k$, the (non-targeted yet) initial and (targeted) successive
updated estimators of the true distribution of $(W,X,Y)$ relies on the
Super Learning methodology (option "superLearning") or not (option
"learning", default value). In the former case, the

`SuperLearner`

package is loaded.lib

A

`list`

providing the function `tmle.npvi`

with the necessary
algorithms involved in the estimation of the features of interest. If
empty (default) then the default algorithms are used. See `Details`

.nodes

An

`integer`

, which indicates how many nodes should be involved in
the computation of the TMLE when it relies on the "superLearning"
`flavor`

. Defaults to `1`

. If larger than `1`

, then the
`parallel`

package is loaded and a cluster with `nodes`

nodes is
created and exploited.cvControl

'NULL' (default value) or an

`integer`

indicating how many folds are
involved in the Super Learning procedure. If `flavor`

and
`cvControl`

are simultaneously set to "superLearning" and `NULL`

then the Super Learning procedure relies on 10-fold cross-validation.family

Indicates whether the simulation of the conditional distribution of
$X$ given $W$ under $P_n^k$ (the initial estimator if
$k=0$ or its $k$th update if $k \ge 1$) should be based on a
weighted version of the empirical measure (case "parsimonious", default
value and faster execution) or on a Gaussian model (case "gaussian").

cleverCovTheta

A

`logical`

, indicating whether the one-step (if `FALSE`

,
default value) or the two-step (if `TRUE`

) updating procedure should
be carried out.bound

A positive

`numeric`

(defaults to `1`

), upper-bound on the
absolute value of the fluctuation parameter.B

An

`integer`

(defaults to `1e5`

) indicating the sample size of
the data set simulated under each $P_n^k$ to compute an approximated
value of $\Psi(P_n^k)$), the parameter of interest at $P_n^k$. The
larger `B`

, the more accurate the approximation.trueGMu

Either **true** conditional
probability $g(W)=P(X=0|W)$ and conditional expectation
$muAux(W)=E_P(X|X\neq 0, W)$ are not both known beforehand.
Otherwise, a

`NULL`

(default value) if the `list`

with tags `g`

and `muAux`

and
respective values the two respective `function`

s (see the
related outputs of `getSample`

).iter

An

`integer`

(defaults to `5`

) indicating a maximal number of
updates.stoppingCriteria

A

`list`

providing tuning parameters for the stopping
criteria. Defaults to `list(mic=0.01, div=0.01, psi=0.1)`

, see
`Details`

.gmin

A positive

`numeric`

, lower-bound on the range of values of the
estimated probabilities $P_n^k(X=0|W)$ that $X$ be equal to its
reference value `0`

given $W$. Defaults to `5e-2`

, and must
be smaller than `gmax`

.gmax

A positive

`numeric`

smaller than `1`

, upper-bound on the range
of values of the estimated probabilities $P_n^k(X=0|W)$ that $X$
be equal to its reference value `0`

given $W$. Defaults to
`95e-2`

, and must be larger than `gmin`

.mumin

A

`numeric`

, lower-bound on the range of values of the estimated
conditional expectation $E_{P_n^k}(X|X\neq 0,W)$ of $X$ given
$X\neq 0$ and $W$. Defaults to the first percentile of `X`

,
and must be smaller than `mumax`

.mumax

A

`numeric`

, upper-bound on the range of values of the estimated
conditional expectation $E_{P_n^k}(X|X\neq 0,W)$ of $X$ given
$X\neq 0$ and $W$. Defaults to the 99th percentile of $X$,
and must be larger than `mumin`

.verbose

Prescribes the amount of information output by the function. Defaults to

`FALSE`

.tabulate

A

`logical`

, kept for compatibility, which should be set to
`TRUE`

(its default value). This requires that the joint distribution
of $(X,W)$ under $P_n^k$ be a weighted version of the empirical
measure (for a faster execution).exact

A

`logical`

, kept for compatibility, which should be set to
`TRUE`

(its default value). This requires that the successive
updates of the estimators of the relevant features of the true
distribution of $(W,X,Y)$ be derived from the current estimators based
on exact formulas rather than on their first-order approximations.light

A

`logical`

, kept for compatibility, which should be set to
`TRUE`

(its default value). This requires that the result of each fit
be reduced in size (for a faster execution). Currently implemented only
for flavor `learning`

.-
of the TMLE procedure. The
- obs
- The
`matrix`

of observations used to carry out the TMLE procedure. Use the`method`

`getObs`

to retrieve it. - psi
- The TMLE of the parameter of interest. Use the
`method`

`getPsi`

to retrieve it. - psi.sd
- The estimated standard deviation of the TMLE of the
parameter of interest. Use the
`method`

`getPsiSd`

to retrieve it.

`method`

`getHistory`

outputs the
"history" of the procedure (see `getHistory`

). The object notably
includes the following information:
The TMLE procedure stops when the maximal number of
iterations, `iter`

, is reached or when at least one of
the following criteria is met:

- The empirical mean $P_n effIC(P_n^{k+1})$ of the
efficient influence curve at $P_n^{k+1}$ scaled by the
estimated standard deviation of the efficient influence curve
at $P_n^{k+1}$ is smaller, in absolute value, than
`mic`

. - The total variation (TV) distance between P_n^k and
P_n^k+1 is smaller than
`div`

. - The change between the successive values
$Psi(P_n^k)$ and $Psi(P_n^{k+1})$ is smaller than
`psi`

.

If `lib`

is an empty list (`list()`

, default value) then the
default algorithms for the chosen `flavor`

are loaded
(`learningLib`

when `flavor`

is set to "learning" or
`superLearningLib`

when `flavor`

is set to "superLearning").
A valid `lib`

argument must mimick the structure of either
`learningLib`

or `superLearningLib`

, depending on
`flavor`

.

The "superLearning" `flavor`

requires the `SuperLearner`

package and, by default, the `e1071`

, `gam`

,
`glmnet`

, `polspline`

and `randomForest`

packages.

If `family`

is set to "parsimonious" (recommended) then the
packages `sgeostat`

and `geometry`

are required.

Chambaz, A., Neuvial, P. (2015). tmle.npvi: targeted, integrative search of associations between DNA copy number and gene expression, accounting for DNA methylation. To appear in Bioinformatics Applications Notes.

set.seed(12345) ## ## Simulating a data set and computing the true value of the parameter ## ## Parameters for the simulation (case 'f=identity') O <- cbind(W=c(0.05218652, 0.01113460), X=c(2.722713, 9.362432), Y=c(-0.4569579, 1.2470822)) O <- rbind(NA, O) lambda0 <- function(W) {-W} p <- c(0, 1/2, 1/2) omega <- c(0, 3, 3) S <- matrix(c(10, 1, 1, 0.5), 2 ,2) ## Simulating a data set of 200 i.i.d. observations sim <- getSample(2e2, O, lambda0, p=p, omega=omega, sigma2=1, Sigma3=S) obs <- sim$obs ## Adding (dummy) baseline covariates V <- matrix(runif(3*nrow(obs)), ncol=3) colnames(V) <- paste("V", 1:3, sep="") obs <- cbind(V, obs) ## Caution! MAKING '0' THE REFERENCE VALUE FOR 'X' X0 <- O[2,2] obsC <- obs obsC[, "X"] <- obsC[, "X"] - X0 obs <- obsC ## True psi and confidence intervals (case 'f=identity') sim <- getSample(1e4, O, lambda0, p=p, omega=omega, sigma2=1, Sigma3=S) truePsi <- sim$psi confInt0 <- truePsi + c(-1, 1)*qnorm(.975)*sqrt(sim$varIC/nrow(sim$obs)) confInt <- truePsi + c(-1, 1)*qnorm(.975)*sqrt(sim$varIC/nrow(obs)) cat("\nCase f=identity:\n") msg <- paste("\ttrue psi is: ", signif(truePsi, 3), "\n", sep="") msg <- paste(msg, "\t95%-confidence interval for the approximation is: ", signif(confInt0, 3), "\n", sep="") msg <- paste(msg, "\toptimal 95%-confidence interval is: ", signif(confInt, 3), "\n", sep="") cat(msg) ## ## TMLE procedure ## ## Running the TMLE procedure npvi <- tmle.npvi(obs, f=identity, flavor="learning", B=5e4, nMax=10) ## Summarizing its results npvi setConfLevel(npvi, 0.9) npvi history <- getHistory(npvi) print(round(history, 4)) hp <- history[, "psi"] hs <- history[, "sic"] hs[1] <- NA ics <- c(-1,1) %*% t(qnorm(0.975)*hs/sqrt(nrow(getObs(npvi)))) pch <- 20 ylim <- range(c(confInt, hp, ics+hp), na.rm=TRUE) xs <- (1:length(hs))-1 plot(xs, hp, ylim=ylim, pch=pch, xlab="Iteration", ylab=expression(psi[n]), xaxp=c(0, length(hs)-1, length(hs)-1)) dummy <- sapply(seq(along=xs), function(x) lines(c(xs[x],xs[x]), hp[x]+ics[, x])) abline(h=confInt, col=4) abline(h=confInt0, col=2)