tmle.npvi(obs, f = identity, nMax = 30L, flavor = c("learning", "superLearning"), lib = list(), nodes = 1L, cvControl = NULL, family = c("parsimonious", "gaussian"), cleverCovTheta = FALSE, bound = 1, B = 1e+05, trueGMu = NULL, iter = 5L, stoppingCriteria = list(mic = 0.01, div = 0.01, psi = 0.1), gmin = 0.05, gmax = 0.95, mumin = quantile(f(obs[obs[, "X"] != 0, "X"]), type = 1, probs = 0.01), mumax = quantile(f(obs[obs[, "X"] != 0, "X"]), type = 1, probs = 0.99), verbose = FALSE, tabulate = TRUE, exact = TRUE, light = TRUE)
n x p
matrix
or data frame
of observations, with
$p \ge 3$. "X"
corresponds to the
continuous exposure variable (e.g. DNA copy number), or "cause" in a
causal model, with a reference value $x_0$ equal to 0. "Y"
corresponds to the outcome variable (e.g. gene expression
level), or "effect" in a causal model. function
involved in the definition of the parameter of interest
$\psi$, which must satisfy $f(0)=0$ (see Details). Defaults to
identity
.integer
(defaults to 30L
; 10L
is the
smallest authorized value and we recommend a value less than
50L
for reasonable computational time) indicating the maximum number of
observed values of $X\neq 0$ which are used to create the
supports of the conditional distributions of $X$ given $W$
and $X\neq0$ involved in the simulation under $P_n^k$ when
family
is set to "parsimonious".SuperLearner
package is loaded.list
providing the function tmle.npvi
with the necessary
algorithms involved in the estimation of the features of interest. If
empty (default) then the default algorithms are used. See Details
.integer
, which indicates how many nodes should be involved in
the computation of the TMLE when it relies on the "superLearning"
flavor
. Defaults to 1
. If larger than 1
, then the
parallel
package is loaded and a cluster with nodes
nodes is
created and exploited.integer
indicating how many folds are
involved in the Super Learning procedure. If flavor
and
cvControl
are simultaneously set to "superLearning" and NULL
then the Super Learning procedure relies on 10-fold cross-validation.logical
, indicating whether the one-step (if FALSE
,
default value) or the two-step (if TRUE
) updating procedure should
be carried out.numeric
(defaults to 1
), upper-bound on the
absolute value of the fluctuation parameter.integer
(defaults to 1e5
) indicating the sample size of
the data set simulated under each $P_n^k$ to compute an approximated
value of $\Psi(P_n^k)$), the parameter of interest at $P_n^k$. The
larger B
, the more accurate the approximation.NULL
(default value) if the true conditional
probability $g(W)=P(X=0|W)$ and conditional expectation
$muAux(W)=E_P(X|X\neq 0, W)$ are not both known beforehand.
Otherwise, a list
with tags g
and muAux
and
respective values the two respective function
s (see the
related outputs of getSample
).integer
(defaults to 5
) indicating a maximal number of
updates.list
providing tuning parameters for the stopping
criteria. Defaults to list(mic=0.01, div=0.01, psi=0.1)
, see
Details
.numeric
, lower-bound on the range of values of the
estimated probabilities $P_n^k(X=0|W)$ that $X$ be equal to its
reference value 0
given $W$. Defaults to 5e-2
, and must
be smaller than gmax
.numeric
smaller than 1
, upper-bound on the range
of values of the estimated probabilities $P_n^k(X=0|W)$ that $X$
be equal to its reference value 0
given $W$. Defaults to
95e-2
, and must be larger than gmin
.numeric
, lower-bound on the range of values of the estimated
conditional expectation $E_{P_n^k}(X|X\neq 0,W)$ of $X$ given
$X\neq 0$ and $W$. Defaults to the first percentile of X
,
and must be smaller than mumax
.numeric
, upper-bound on the range of values of the estimated
conditional expectation $E_{P_n^k}(X|X\neq 0,W)$ of $X$ given
$X\neq 0$ and $W$. Defaults to the 99th percentile of $X$,
and must be larger than mumin
.FALSE
.logical
, kept for compatibility, which should be set to
TRUE
(its default value). This requires that the joint distribution
of $(X,W)$ under $P_n^k$ be a weighted version of the empirical
measure (for a faster execution).logical
, kept for compatibility, which should be set to
TRUE
(its default value). This requires that the successive
updates of the estimators of the relevant features of the true
distribution of $(W,X,Y)$ be derived from the current estimators based
on exact formulas rather than on their first-order approximations.logical
, kept for compatibility, which should be set to
TRUE
(its default value). This requires that the result of each fit
be reduced in size (for a faster execution). Currently implemented only
for flavor learning
.method
getHistory
outputs the
"history" of the procedure (see getHistory
). The object notably
includes the following information:
matrix
of observations used to carry out the TMLE
procedure. Use the method
getObs
to retrieve it.method
getPsi
to retrieve it.method
getPsiSd
to retrieve
it.The TMLE procedure stops when the maximal number of
iterations, iter
, is reached or when at least one of
the following criteria is met:
mic
.
div
.
psi
.
If lib
is an empty list (list()
, default value) then the
default algorithms for the chosen flavor
are loaded
(learningLib
when flavor
is set to "learning" or
superLearningLib
when flavor
is set to "superLearning").
A valid lib
argument must mimick the structure of either
learningLib
or superLearningLib
, depending on
flavor
.
The "superLearning" flavor
requires the SuperLearner
package and, by default, the e1071
, gam
,
glmnet
, polspline
and randomForest
packages.
If family
is set to "parsimonious" (recommended) then the
packages sgeostat
and geometry
are required.
Chambaz, A., Neuvial, P. (2015). tmle.npvi: targeted, integrative search of associations between DNA copy number and gene expression, accounting for DNA methylation. To appear in Bioinformatics Applications Notes.
set.seed(12345)
##
## Simulating a data set and computing the true value of the parameter
##
## Parameters for the simulation (case 'f=identity')
O <- cbind(W=c(0.05218652, 0.01113460),
X=c(2.722713, 9.362432),
Y=c(-0.4569579, 1.2470822))
O <- rbind(NA, O)
lambda0 <- function(W) {-W}
p <- c(0, 1/2, 1/2)
omega <- c(0, 3, 3)
S <- matrix(c(10, 1, 1, 0.5), 2 ,2)
## Simulating a data set of 200 i.i.d. observations
sim <- getSample(2e2, O, lambda0, p=p, omega=omega, sigma2=1, Sigma3=S)
obs <- sim$obs
## Adding (dummy) baseline covariates
V <- matrix(runif(3*nrow(obs)), ncol=3)
colnames(V) <- paste("V", 1:3, sep="")
obs <- cbind(V, obs)
## Caution! MAKING '0' THE REFERENCE VALUE FOR 'X'
X0 <- O[2,2]
obsC <- obs
obsC[, "X"] <- obsC[, "X"] - X0
obs <- obsC
## True psi and confidence intervals (case 'f=identity')
sim <- getSample(1e4, O, lambda0, p=p, omega=omega, sigma2=1, Sigma3=S)
truePsi <- sim$psi
confInt0 <- truePsi + c(-1, 1)*qnorm(.975)*sqrt(sim$varIC/nrow(sim$obs))
confInt <- truePsi + c(-1, 1)*qnorm(.975)*sqrt(sim$varIC/nrow(obs))
cat("\nCase f=identity:\n")
msg <- paste("\ttrue psi is: ", signif(truePsi, 3), "\n", sep="")
msg <- paste(msg, "\t95%-confidence interval for the approximation is: ",
signif(confInt0, 3), "\n", sep="")
msg <- paste(msg, "\toptimal 95%-confidence interval is: ",
signif(confInt, 3), "\n", sep="")
cat(msg)
##
## TMLE procedure
##
## Running the TMLE procedure
npvi <- tmle.npvi(obs, f=identity, flavor="learning", B=5e4, nMax=10)
## Summarizing its results
npvi
setConfLevel(npvi, 0.9)
npvi
history <- getHistory(npvi)
print(round(history, 4))
hp <- history[, "psi"]
hs <- history[, "sic"]
hs[1] <- NA
ics <- c(-1,1) %*% t(qnorm(0.975)*hs/sqrt(nrow(getObs(npvi))))
pch <- 20
ylim <- range(c(confInt, hp, ics+hp), na.rm=TRUE)
xs <- (1:length(hs))-1
plot(xs, hp, ylim=ylim, pch=pch, xlab="Iteration", ylab=expression(psi[n]),
xaxp=c(0, length(hs)-1, length(hs)-1))
dummy <- sapply(seq(along=xs), function(x) lines(c(xs[x],xs[x]), hp[x]+ics[, x]))
abline(h=confInt, col=4)
abline(h=confInt0, col=2)
Run the code above in your browser using DataLab