Learn R Programming

optBiomarker (version 1.0-22)

simData: Simulation of microarray data

Description

The function simulates microarray data for two-group comparison with user supplied parameters such as number of biomarkers (genes or proteins), sample size, biological and experimental (technical) variation, replication, differential expression, and correlation between biomarkers.

Usage

simData(nTrain=100,
        nGr1=floor(nTrain/2),
        nBiom=50,nRep=3,
        sdW=1.0,
        sdB=1.0,rho=0,
        sigma=0.1,diffExpr=TRUE,
        foldMin=2,
        orderBiom=TRUE,
        baseExpr=NULL)

Arguments

nTrain
Training set size,.i.e., the total number of biological samples in group 1 (nGr1) and group 2.
nGr1
Size of group 1. Defaults to floor(nTrain/2).
nBiom
Number of biomarkers (genes, probes or proteins).
nRep
Number of technical replications.
sdW
Experimental (technical) variation ($\sigma_e$) of data in log (base 2) scale.
sdB
Biological variation ($\sigma_b$) of data in log (base 2) scale.
rho
Common Pearson correlation between biomarkers. To ensure positive definiteness, allowed values of rho are restricted between 0 and 0.95 inclusive.
sigma
Standard deviation of the normal distribution (before truncation) where fold changes are generated from. See details.
diffExpr
Logical. Should systematic difference be introduced between the data of the two groups?
foldMin
Minimum value of fold changes. See details.
orderBiom
Logical. Should columns (biomarkers) be arranged in order of differential expression?
baseExpr
A vector of length nBiom to be used as base expressions $\mu$. See realBiomarker for details.

Value

  • A dataframe of dimension nTrain by nBiom+1. The first column is a factor (class) representing the group memberships of the samples.

Details

Differential expressions are introduced by adding $z\delta$ to the data of group 2 where $\delta$ values are generated from a truncated normal distribution and $z$ is randomly selected from (-1,1) to characterise up- or down-regulation.

Assuming that $Y ~is~ N(\mu, \sigma^2)$, and $A=[a_1,a_2]$, a subset of $-Inf

$$f(y, \mu, \sigma)= (1/\sigma) \phi((y-\mu)/\sigma) / (\Phi((a2-\mu)/\sigma) - \Phi((a_1-\mu)/\sigma))$$

for $a_1 <= 0="" y="" <="a_2$," and="" otherwise,<="" p="">

where $\mu$ is the mean of the original Normal distribution before truncation, $\sigma$ is the corresponding standard deviation,$a_2$ is the upper truncation point, $a_1$ is the lower truncation point, $\phi(x)$ is the density of the standard normal distribution, and $\Phi(x)$ is the distribution function of the standard normal distribution. For simData function, we consider $a_1=log_2(\code{foldMin})$ and $a_2=Inf$. This ensures that the biomarkers are differentially expressed by a fold change of foldMin or more.

See Also

classificationError

Examples

Run this code
simData(nTrain=10,nBiom=3)

Run the code above in your browser using DataLab