Init: Small EM parameter initialization

Description

These functions implement the Small EM initialization strategy described in Rau et al. (2011) to obtain initial values for the parameters of a Poisson mixture model.

Usage

emInit(y, g, conds, lib.size, lib.type = "TC", alg.type = "EM", 
    starts = 5, verbose = FALSE)

kmeanInit(y, g, conds, lib.size, lib.type = "TC")

Arguments

(n x q) matrix of observed counts for n observations and q variables

Number of clusters

conds

Vector of length q defining the condition (treatment group) for each variable (column) in y

lib.size

If FALSE, the library size parameter is not included in the model (i.e., the PMM-I model). If TRUE, the library size parameter is included in the Poisson mixture model (i.e., the PMM-II model)

lib.type

If lib.size = TRUE, the type of estimator to be used for the library size parameter (TC for total count, Q for quantile, and MedRatio for the median rat

alg.type

Algorithm to be used for parameter estimation (EM or CEM for the EM or CEM algorithms, respectively)

starts

The number independent runs with the Small-EM algorithm (with a default value of five)

verbose

If TRUE, include verbose output

Value

pi.initVector of length g containing the estimate for $\hat{\ensuremath\boldsymbol{\pi}}$ corresponding to the highest log likelihood (or completed log likelihood) from the Small-EM inialization strategy.
lambda.init(d x g) matrix containing the estimate of $\hat{\ensuremath\boldsymbol{\lambda}}$ corresponding to the highest log likelihood (or completed log likelihood) from the Small-EM initialization strategy, where d is the number of conditions and g is the number of clusters.

Details

To initialize parameter values for the EM and CEM algorithms we use a so-called Small-EM strategy (Biernacki et al., 2003) using the emInit function. Five independent times, the following procedure is used to obtain parameter values: first, a K-means algorithm (MacQueen, 1967) is run to partition the data into g clusters ($\hat{\ensuremath\boldsymbol{z}}^{(0)}$). Second, initial parameter values $\ensuremath\boldsymbol{\pi}^{(0)}$ and $\ensuremath\boldsymbol{\lambda}^{(0)}$ are calculated (see Rau et al. (2011) for details). Third, five iterations of an EM algorithm are run, using $\ensuremath\boldsymbol{\pi}^{(0)}$ and $\ensuremath\boldsymbol{\lambda}^{(0)}$ as initial values. Finally, among the five sets of parameter values, we use $\hat{\ensuremath\boldsymbol{\lambda}}$ and $\hat{\ensuremath\boldsymbol{\pi}}$ corresponding to the highest log likelihood or completed log likelihood to initialize the subsequent full EM or CEM algorithms, respectively.

References

Anders, S. and Huber, W. (2010) Differential expression analysis for sequence count data. Genome Biology, 11(R106), 1-28. Biernacki, C., Celeux, G., Govaert, G. (2003) Choosing starting values for the EM algorithm for getting the highest likelhiood in multivariate Gaussian mixture models. Computational Statisitcs and Data Analysis, 41(1), 561-575. MacQueen, J. B. (1967) Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, number 1, pages 281-297. Berkeley, University of California Press. Rau, A., Celeux, G., Martin-Magniette, M.-L., Maugis-Rabusseau, C (2011). Clustering high-throughput sequencing data with Poisson mixture models. Inria Research Report 7786. Available at http://hal.inria.fr/inria-00638082.

Examples

Run this code

set.seed(12345)

## Simulate data as shown in Rau et al. (2011)
## Library size setting "A", high cluster separation
## n = 500 observations

simulate <- PoisMixSim(n = 500, libsize = "A", separation = "high")
y <- simulate$y
conds <- simulate$conditions

## Calculate initial values for lambda and pi using the Small-EM
## initialization (4 classes, PMM-II model with "TC" library size)

init.values <- emInit(y, g = 4, conds, lib.size = TRUE, 
    lib.type = "TC", alg.type = "EM")
pi.init <- init.values$pi.init
lambda.init <- init.values$lambda.init

Run the code above in your browser using DataLab