generateSyntheticData(dataset, n.vars, samples.per.cond, n.diffexp, repl.id = 1, seqdepth = 1e+07, minfact = 0.7, maxfact = 1.4, relmeans = "auto", dispersions = "auto", fraction.upregulated = 1, between.group.diffdisp = FALSE, filter.threshold.total = 1, filter.threshold.mediancpm = 0, fraction.non.overdispersed = 0, random.outlier.high.prob = 0, random.outlier.low.prob = 0, single.outlier.high.prob = 0, single.outlier.low.prob = 0, effect.size = 1.5, output.file = NULL)
filter.threshold.total
and filter.threshold.mediancpm
), the number of genes in the final data set may be lower than this number.minfact
and maxfact
for each sample to generate data with different actual sequencing depths.seqdepth
to generate individual sequencing depths for the simulated samples."auto"
. Note that these values may be scaled in order to comply with the given sequencing depth. With the default value ("auto"
), the mean values are sampled from values estimated from the Pickrell and Cheung data sets. If relmeans
is a vector, the provided values will be used as mean values in the simulation for the samples in the first condition. The mean values for the samples in the second condition are generated by combining the relmeans
and effect.size
arguments."auto"
. With the default value ("auto"
), the dispersion values are sampled from values estimated from the Pickrell and Cheung data sets. If both relmeans
and dispersions
are set to "auto"
, the means and dispersion values are sampled in pairs from the values in these data sets. If dispersions
is a single vector, the provided dispersions will be used for simulating data from both conditions. If it is a matrix with two columns, the values in the first column are used for condition 1, and the values in the second column are used for condition 2.dispersions
is "auto"
.effect.size
. For genes that are upregulated in the second condition, the mean in the first condition is multiplied by the effect size. For genes that are downregulated in the second condition, the mean in the first condition is divided by the effect size. It is also possible to provide a vector of effect sizes (one for each gene), which will be used as provided. In this case, the fraction.upregulated
and n.diffexp
arguments will be ignored and the values will be derived from the effect.size
vector.NULL
, the path to the file where the data object should be saved. The extension should be .rds
, if not it will be changed.compData
object. If output.file
is not NULL
, the object is saved in the given output.file
(which should have an .rds
extension).
dataset
parameter will be compared. Hence, it is important to give the same value of this parameter e.g. to different replicates generated with the same simulation settings.For more detailed information regarding the different types of outliers, see Soneson and Delorenzi (2013).
Mean and dispersion parameters (if relmeans
and/or dispersions
is set to "auto"
) are sampled from values estimated from the data sets by Pickrell et al (2010) and Cheung et al (2010). The data sets were downloaded from the ReCount web page (Frazee et al (2011)) and processed as detailed by Soneson and Delorenzi (2013).
To get the actual mean value for the Negative Binomial distribution used for the simulation of counts for a given sample, take the column truemeans.S1
(or truemeans.S2
, if the sample is in condition S2) of the variable.annotations
slot, divide by the sum of the same column and multiply with the base sequencing depth (provided in the info.parameters
list) and the depth factor for the sample (given in the sample.annotations
data frame). Thus, if you have a vector of mean values that you want to provide as the relmeans
argument and make sure to use it 'as-is' in the simulation (for condition S1), make sure to set the seqdepth
argument to the sum of the values in the relmeans
vector, and to set minfact
and maxfact
equal to 1.
Cheung VG, Nayak RR, Wang IX, Elwyn S, Cousins SM, Morley M and Spielman RS (2010): Polymorphic cis- and trans-regulation of human gene expression. PLoS Biology 8(9):e1000480
Frazee AC, Langmead B and Leek JT (2011): ReCount: a multi-experiment resource of analysis-ready RNA-seq gene count datasets. BMC Bioinformatics 12:449
Pickrell JK, Marioni JC, Pai AA, Degner JF, Engelhardt BE, Nkadori E, Veyrieras JB, Stephens M, Gilad Y and Pritchard JK (2010): Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature 464, 768-772
Robles JA, Qureshi SE, Stephen SJ, Wilson SR, Burden CJ and Taylor JM (2012): Efficient experimental design and analysis strategies for the detection of differential expression using RNA-sequencing. BMC Genomics 13:484
mydata.obj <- generateSyntheticData(dataset = "mydata", n.vars = 1000,
samples.per.cond = 5, n.diffexp = 100)
Run the code above in your browser using DataLab