choisir.seuil: Cut-off selection by simulations

Description

Obtaining the optimal p-value cut-off for individual tests to achieve a given Type I error level of obtaining disjoint components of the graph

Usage

choisir.seuil( n.genes,
               taille.groupes = c( 10, 10 ),
               alpha.cible = 0.05,
               seuil.p = (5:30)/100,
               B = 3000, conf.level = 0.95,
               f.p = student.fpc, frm = R ~ Groupe,
               normaliser = FALSE, en.log = TRUE,
               n.quantifies = n.genes, masque,
               n.coeurs = 1,
               ... )

Value

choisir.seuil returns a data.frame with four columns, corresponding to the candidate cut-offs, the corresponding estimated type-I error and its lower and upper confidence bounds, and attributes giving the estimated optimal cut-off, its confidence interval and details on simulation condition. This data.frame has the additional class SARPcompo.H0, allowing specific print and

plot methods to be used.

Arguments

n.genes

Number of genes to be quantified simultaneously

taille.groupes

An integer vector containing the sample size for each group. The number of groups is determined by the length of this vector. Unused if masque is provided.

alpha.cible

The target type I error level of obtaining disjoint subnetworks under the null hypothesis that gene expressions are the same in all groups. Should be between 0 and 1.

seuil.p

A numeric vector of candidate cutoffs. Values outside the [0,1] interval are automatically removed. The default (from 0.05 to 0.30) is suited for a target type I error of 0.05 and less than 30 genes, roughly.

B

How many simulations to do.

conf.level

The confidence level of the interval given as a result (see Details).

f.p

The function to use for individual tests of each ratio. See creer.Mp for details.

frm

The formula to use. The default is suited for the structure of the simulated data, with R the ratio and Groupe the variable with group membership.

normaliser

Should the simulated data by normalised, that is should their sum be equal to 1? Since ratio are insensitive to the normalisation (by contrast with individual quantities), it is a useless step for usual designs, hence the default.

en.log

If TRUE, generated data are seen as log of quantities, hence the normalisation step is performed after exponentiation of the data and data are converted back in log.

The option is also used in the call of creer.Mp.

n.quantifies

The number of quantified genes amongst the n.genes simulated. Must be at most equal to n.genes, which is the default.

masque

A data.frame containing the values of needed covariates for all replicates. If missing, a one-column data.frame generated using taille.groupes, the column (named ‘Groupe’) containing values ‘G1’ repeated taille.groupes[ 1 ] times, ‘G2’ repeated taille.groupes[ 2 ] times and so on.

n.coeurs

The number of CPU cores to use in computation, with parallelization using forks (does not work on Windows) with the help of the parallel package.

...

additional arguments, to be used by the analysis function f.p

Warning

The simulated ratios are stored in a column called R, appended to the simulated data.frame. For this reason, do not use any column of this name in the provided masque: it would be overwritten during the simulation process.

Author

Emmanuel Curis (emmanuel.curis@parisdescartes.fr)

Details

The choisir.seuil function simulates B datasets of n.genes “quantities” measured several times, under the null hypothesis that there is only random variations between samples. For each of these B datasets, creer.Mp is called with the provided test function, then converted to a graph using in turn all cut-offs given in seuil.p and the number of components of the graph is determined. Having more than one is a type I error.

For each cut-off in seuil.p, the proportion of false-positive is then determined, along with its confidence interval (using the exact, binomial formula). The optimal cut-off to achieve the target type I error is then found by linear interpolation.

Simulation is done assuming a log-normal distribution, with a reduced, centered Gaussian on the log scale. Since under the null hypothesis nothing changes between the groups, the only needed informations is the total number of values for a given gene, which is determined from the number of rows of masque. All columns of masque are transfered to the analysis function, so simulation under virtually any experimental design should be possible, as far as a complete null hypothesis is wanted (not any effect of any covariate).

Examples

Run this code

   # What would be the optimal cut-off for 10 genes quantified in two
   #  groups of 5 replicates?
   # For speed reason, only 50 simulations are done here,
   #  but obviously much more are needed to have a good estimate f the cut-off.

   seuil <- choisir.seuil( 10, c( 5, 5 ), B = 50 )
   seuil

   # Get the cut-off and its confidence interval
   attr( seuil, "seuil" )

   # Plot the results
   plot( seuil )

Run the code above in your browser using DataLab