Learn R Programming

gaga (version 2.18.0)

forwsimDiffExpr: Forward simulation for differential expression.

Description

Forward simulation allows to evaluate the expected utility for sequential designs. Here the utility is the expected number of true discoveries minus a sampling cost. The routine simulates future data either from the prior predictive or using a set of pilot data and a GaGa or normal-normal model fit. At each future time point, it computes a summary statistic that will be used to determine when to stop the experiment.

Usage

forwsimDiffExpr(fit, x, groups, ngenes, maxBatch, batchSize, fdrmax = 0.05, genelimit, v0thre = 1, B = 100, Bsummary = 100, trace = TRUE, randomSeed, mc.cores=1)

Arguments

fit
Either GaGa or MiGaGa fit (object of type gagafit, as returned by fitGG) or Normal-Normal fit (type nnfit returned by fitNN).
x
ExpressionSet, exprSet, data frame or matrix containing the gene expression measurements used to fit the model.
groups
If x is of type ExpressionSet or exprSet, groups should be the name of the column in pData(x) with the groups that one wishes to compare. If x is a matrix or a data frame, groups should be a vector indicating to which group each column in x corresponds to.
ngenes
Number of genes to simulate data for. If x is specified this argument is set to nrow(x) and data is simulated from the posterior predictive conditional on x. If x not specified simulation is from the prior predictive.
maxBatch
Maximum number of batches, i.e. the routine simulates batchSize*maxBatch samples per group.
batchSize
Batch size, i.e. number of observations per group to simulate at each time point. Defaults to ncol(x)/length(unique(groups)).
fdrmax
Upper bound on FDR.
genelimit
Only the genelimit genes with the lowest probability of being equally expressed across all groups will be simulated. Setting this limit can significantly increase the computational speed.
v0thre
Only genes with posterior probability of being equally expressed < v0thre will be simulated. Setting this limit can significantly increase the computational speed.
B
Number of forward simulations.
Bsummary
Number of simulations for estimating the summary statistic.
trace
For trace==TRUE iteration progress is displayed.
randomSeed
Integer value used to set random number generator seed. Defaults to as.numeric(Sys.time()) modulus 10^6.
mc.cores
If multicore package is available, mc.cores indicates the number of cores to use for parallel computing. Currently only used when fit is of class nnfit.

Value

A data.frame with the following columns:
simid
Simulation number.
j
Time (sample size).
u
Expected number of true positives if we were to stop experimentation at this time.
fdr
Expected FDR if we were to stop experimentation at this time.
fnr
Expected FNR if we were to stop experimentation at this time.
power
Expected power (as estimated by E(TP)/E(positives)) if we were to stop experimentation at this time.
summary
Summary statistic: increase in expected true positives if we were to obtain one more data batch.

Details

To improve computational speed hyper-parameters are not re-estimated as new data is simulated.

References

Rossell D., Mueller P. Sequential sample sizes for high-throughput hypothesis testing experiments. http://sites.google.com/site/rosselldavid/home. Rossell D. GaGa: a simple and flexible hierarchical model for microarray data analysis. Annals of Applied Statistics, 2009, 3, 1035-1051.

See Also

plotForwSim to plot the simulated trajectories, fitGG for fitting a GaGa model, fitNN for fitting a normal-normal model, seqBoundariesGrid for finding the optimal design based on the forwards simulation output. powfindgenes for fixed sample size calculations.

Examples

Run this code
#Simulate data and fit GaGa model
set.seed(1)
x <- simGG(n=20,m=2,p.de=.5,a0=3,nu=.5,balpha=.5,nualpha=25)
gg1 <- fitGG(x,groups=1:2,method='EM')
gg1 <- parest(gg1,x=x,groups=1:2)

#Run forward simulation
fs1 <- forwsimDiffExpr(gg1, x=x, groups=1:2,
maxBatch=2,batchSize=1,fdrmax=0.05, B=100, Bsummary=100, randomSeed=1)

#Expected number of true positives for each sample size
tapply(fs1$u,fs1$time,'mean')

#Expected utility for each sample size
samplingCost <- 0.01
tapply(fs1$u,fs1$time,'mean') - samplingCost*(0:2)

#Optimal sequential design
b0seq <- seq(0,20,length=200); b1seq <- seq(0,40,length=200)
bopt <-seqBoundariesGrid(b0=b0seq,b1=b1seq,forwsim=fs1,samplingCost=samplingCost,powmin=0)
bopt <- bopt$opt

plot(fs1$time,fs1$summary,xlab='Additional batches',ylab='E(newly discovered DE genes)')
abline(bopt['b0'],bopt['b1'])
text(.2,bopt['b0'],'Continue',pos=3)
text(.2,bopt['b0'],'Stop',pos=1)

Run the code above in your browser using DataLab