fit.mixtNB: Fitting mixtures of Negative Binomials in RNA-Seq data

Description

A mixture of K Negative Binomial distributions is fitted on the data with the aim to cluster the genes according to their dispersions. The number of groups must be known in advance.

Usage

fit.mixtNB(y, cr, K, it = 200, eps = 1e-05, init = NULL, seme = 1,  filter = TRUE, quiet = FALSE)

Arguments

matrix that contains the (normalized) dataset where the rows contain the set of replicates in the counts in two conditions. Given p the number of genes and n the total number of replicates, the matrix must have dimensions p x n.

a vector with n elements that contains the numerical labels of the conditions

the number of mixture components

maximum number of iterations for the EM algorithm

eps

a tolerance level for checking the convergence of the EM algorithm

init

a list that may contain the initial values for the EM algorithm. It may contain: a K-vector 'a' that are the sizes or dispersions of the Negative Binomials, 'w' is the vector with the K initial values for the weights of the components; 'lambda' is the matrix of dimension p x 2 with the initial values of the lambda parameters. If init is NULL, a random initialization will be used.

seme

A numerical value to be used in the set.seed function

filter

Logical to indicate the genes with very small counts should be removed

quiet

Logical to indicate if information about the fitting should be provided

Value

y: The filtered data
K: The number of components
cr: Labels denoting the condition for the replicates
cl: Posterior classification of the genes to the components
likelihood: Log-likelihood at each iteration
AIC: Akaike information criterion
BIC: Bayesian information criterion
a: Estimated dispersions
lambda: Estimated means
f.z.y: Estimated posterior probabilities
time.sec: Computational time (in seconds)
variances: Estimated variances
gname: Positions of the filtered genes

Details

A mixture of K components is fitted with the aim of clustering the genes according to their dispersions. Genes with too small number of reads across experiments are filtered out. The default is to filter out genes with no more than 5 reads totally across all experiments, AND with no more than 0.5 reads averagely across all experiments. The EM algorithm stops when the maximum number of iterations are reached or the relative increment of log-likelihood is smaller than eps.

References

E. Bonafede, F. Picard, S. Robin and C. Viroli (2015), Modelling overdispersion heterogeneity in differential expression analysis using mixtures, under revision.

Examples

Run this code

# create a toy data set with 1000 genes, and 5 samples in each of the two conditions. 
# The first 100 genes are DE expressed. The other 900 genes are null. 
 
lambda.de<-matrix(runif(100,0,250),100)
lambda.de=cbind(lambda.de,lambda.de/exp(rnorm(100,0.5,0.125)))
lambda<-rbind(lambda.de,matrix(runif(900,0,250),900,2))
a<-runif(1000,0.5,600)
cr<-rep(1:2,each=5)
y<-matrix(0,1000,10)
for (i in 1:1000) for (l in 1:10) y[i,l]<-rnbinom(1,mu=lambda[i,cr[l]],size=a[i])
fit=fit.mixtNB(y,cr,K=3)

Run the code above in your browser using DataLab