powfindgenes: Power computations for differential expression

Description

powfindgenes evaluates the posterior expected number of true positives (e.g. true gene discoveries) if one were to obtain an additional batch of data. It uses either a GaGa or a normal-normal model fit on a pilot data set.

Usage

powfindgenes(fit, x, groups, batchSize = 1, fdrmax = 0.05, genelimit,
v0thre = 1, B = 1000, mc.cores=1)

Arguments

fit

GaGa/MiGaGa or normal-normal model fit using pilot data x. It must be an object either of type gagafit (see fitGG) or nnfit (see fitNN).

ExpressionSet, exprSet, data frame or matrix containing the gene expression measurements used to fit the model.

groups

If x is of type ExpressionSet or exprSet, groups should be the name of the column in pData(x) with the groups that one wishes to compare. If x is a matrix or a data frame, groups should be a vector indicating to which group each column in x corresponds to.

batchSize

Number of additional samples to obtain per group.

fdrmax

Upper bound on FDR.

genelimit

Only the genelimit genes with the lowest probability of being equally expressed across all groups will be simulated. Setting this limit can significantly increase the computational speed.

v0thre

Only genes with posterior probability of being equally expressed < v0thre will be simulated. Setting this limit can significantly increase the computational speed.

Number of simulations from the GaGa predictive distribution to be used to estimate the posterior expected number of true positives.

mc.cores

If multicore package is available, mc.cores indicates the number of cores to use for parallel computing. Currently only used when fit is of class nnfit.

Value

m: Posterior expected number of true positives (as estimated by the sample mean of B simulations)
s: Standard error of the estimate i.e. SD of the simulations/sqrt(B)

Details

The routine simulates data from the posterior predictive distribution of a GaGa or normal-normal model. That is, first it simulates parameter values (differential expression status, mean expression levels etc.) from the posterior distribution. Then it simulates data using the parameter values drawn from the posterior. Finally the simulated data is used to determine the differential status of each gene, controlling the Bayesian FDR at the fdrmax level, as implemented in findgenes. As the differential expression status is known for each gene, one can evaluate the number of true discoveries in the reported gene list.

In order to improve speed, hyper-parameters are not re-estimated when computing posterior probabilities for the posterior predictive simulated data.

References

Rossell D. GaGa: a simple and flexible hierarchical model for microarray data analysis. http://rosselldavid.googlepages.com.

Examples

Run this code

#Simulate data and fit GaGa model
set.seed(1)
x <- simGG(n=20,m=2,p.de=.5,a0=3,nu=.5,balpha=.5,nualpha=25)
gg1 <- fitGG(x,groups=1:2,method='EM')
gg1 <- parest(gg1,x=x,groups=1:2)

#Expected nb of TP for 1 more sample per group
powfindgenes(gg1,x=x,groups=1:2,batchSize=1,fdrmax=.05)$m

#Expected nb of TP for 10 more samples per group
powfindgenes(gg1,x=x,groups=1:2,batchSize=10,fdrmax=.05)$m

Run the code above in your browser using DataLab

State of Data and AI Literacy Report 2025