The function simulates microarray data for two-group comparison with user supplied parameters such as number of biomarkers (genes or proteins), sample size, biological and experimental (technical) variation, replication, differential expression, and correlation between biomarkers.
simData(nTrain=100,
nGr1=floor(nTrain/2),
nBiom=50,nRep=3,
sdW=1.0,
sdB=1.0,
rhoMax=NULL, rhoMin=NULL, nBlock=NULL,bsMin=3, bSizes=NULL, gamma=NULL,
sigma=0.1,diffExpr=TRUE,
foldMin=2,
orderBiom=TRUE,
baseExpr=NULL)
Training set size,.i.e., the total number of biological
samples in group 1 (nGr1
) and group 2.
Size of group 1. Defaults to floor(nTrain/2)
.
Number of biomarkers (genes, probes or proteins).
Number of technical replications.
Experimental (technical) variation (\(\sigma_e\)) of data in log (base 2) scale.
Biological variation (\(\sigma_b\)) of data in log (base 2) scale.
Maximum Pearson's correlation coefficient between
biomarkers. To ensure positive definiteness, allowed values are
restricted between 0 and 0.95 inclusive. If NULL
, set to
runif(1,min=0.6,max=0.8)
.
Minimum Pearson's correlation coefficient between
biomarkers. To ensure positive definiteness, allowed values are
restricted between 0 and 0.95 inclusive. If NULL
, set to
runif(1,min=0.2,max=0.4)
.
Number of blocks in the block diagonal (Hub-Toeplitz)
correlation matrix. If NULL
, set to 1 for nBiom<5
and
randomly selected from c(1:floor(nBiom/bsMin))
for nBiom>=5
.
Minimum block size. bsMin=3
by default.
A vector of length nBlock
representing the block sizes
(should sum to nBlock
). If NULL
, set to
c(bs+mod,rep(bs,nBlock-1)
, where bs
is the integer
part of nBiom/nBlock
and mod
is the remainder after
integer division.
Specifies a correlation structure. If NULL
, assumes
independence.gamma=0
indicates a single block exchangeable
correlation marix with constant correlation
rho=0.5*(rhoMin+rhoMax)
. A value greater than zero indicates
block diagonal (Hub-Toeplitz) correlation matrix with decline rate
determined by the value of gamma
. Decline rate is linear for
gamma=1
.
Standard deviation of the normal distribution (before truncation) where fold changes are generated from. See details.
Logical. Should systematic difference be introduced between the data of the two groups?
Minimum value of fold changes. See details.
Logical. Should columns (biomarkers) be arranged in order of differential expression?
A vector of length nBiom
to be used as base
expressions \(\mu\). See realBiomarker
for details.
A dataframe of dimension nTrain
by nBiom+1
. The first
column is a factor (class
) representing the group memberships of
the samples.
Differential expressions are introduced by adding \(z\delta\) to the data
of group 2 where \(\delta\) values are generated from a truncated normal
distribution and \(z\) is randomly selected from (-1,1)
to
characterise up- or down-regulation.
Assuming that \(Y ~is~ N(\mu, \sigma^2)\), and \(A=[a_1,a_2]\), a subset of \(-Inf <y < Inf\), the conditional distribution of \(Y\) given \(A\) is called truncated normal distribution:
$$f(y, \mu, \sigma)= (1/\sigma) \phi((y-\mu)/\sigma) / (\Phi((a2-\mu)/\sigma) - \Phi((a_1-\mu)/\sigma))$$
for \(a_1 <= y <= a_2\), and 0 otherwise,
where \(\mu\) is the mean of the original Normal distribution before truncation,
\(\sigma\) is the corresponding standard deviation,\(a_2\) is the upper truncation point,
\(a_1\) is the lower truncation point, \(\phi(x)\) is the density of the
standard normal distribution, and \(\Phi(x)\) is the distribution function
of the standard normal distribution. For simData
function, we
consider \(a_1=log_2(\code{foldMin})\) and \(a_2=Inf\). This ensures that the
biomarkers are differentially expressed by a fold change of
foldMin
or more.
Khondoker, M. R., Till T. Bachmann, T. T., Mewissen, M., Dickinson, P. et al.(2010). Multi-factorial analysis of class prediction error: estimating optimal number of biomarkers for various classification rules. Journal of Bioinformatics and Computational Biology, 8, 945-965.
# NOT RUN {
simData(nTrain=10,nBiom=3)
# }
Run the code above in your browser using DataLab