Init: Parameter initialization for a Poisson mixture model.

Description

These functions implement a variety of initialization methods for the parameters of a Poisson mixture model: the Small EM initialization strategy (emInit) described in Rau et al. (2011), a K-means initialization strategy (kmeanInit) that is itself used to initialize the small EM strategy, the splitting small-EM initialization strategy (splitEMInit) described in Papastamoulis et al. (2012), and a function to initialize a small-EM strategy using the posterior probabilities (probaPostInit) obtained from a previous run with one fewer cluster following the splitting strategy.

Usage

emInit(y, g, conds, lib.size, lib.type, alg.type = "EM", 
    init.runs, init.iter, fixed.lambda, equal.proportions, verbose,
	s=NA)

kmeanInit(y, g, conds, lib.size, lib.type, fixed.lambda,
    equal.proportions, s=NA)

splitEMInit(y, g, conds, lib.size, lib.type, alg.type, fixed.lambda,
    equal.proportions, prev.labels, prev.probaPost, init.runs, 
    init.iter, verbose, s=NA)

probaPostInit(y, g, conds, lib.size, lib.type, alg.type = "EM", 
    fixed.lambda, equal.proportions, probaPost.init, init.iter,
    verbose, s=NA)

Arguments

(n x q) matrix of observed counts for n observations and q variables

Number of clusters. If fixed.lambda contains a list of lambda values to be fixed, g corresponds to the number of clusters in addition to those fixed.

conds

Vector of length q defining the condition (treatment group) for each variable (column) in y

lib.size

If FALSE, the library size parameter is not included in the model (i.e., the PMM-I model). If TRUE, the library size parameter is included in the Poisson mixture model (i.e., the PMM-II model)

lib.type

If lib.size = TRUE, the estimator to be used for the library size parameter: (TC for total count, UQ for upper quantile, Med for median,

alg.type

Algorithm to be used for parameter estimation (EM or CEM for the EM or CEM algorithms, respectively)

init.runs

In the case of the Small-EM algorithm, the number of independent runs to be performed. In the case of the splitting Small-EM algorithm, the number of cluster splits to be performed in the splitting small-EM initialization.

init.iter

The number of iterations to run within each Small-EM algorithm

fixed.lambda

If one (or more) clusters with fixed values of lambda is desires, a list containing vectors of length d (the number of conditions). Note that the values of lambda chosen must satisfy the constraint noted in the technical report.

equal.proportions

If TRUE, the cluster proportions are set to be equal for all clusters. Default is FALSE (unequal cluster proportions)

prev.labels

A vector of length n of cluster labels obtained from the previous run (g-1 clusters)

prev.probaPost

An n x (g-1) matrix of the conditional probabilities of each observation belonging to each of the g-1 clusters from the previous run

probaPost.init

An n x (g) matrix of the conditional probabilities of each observation belonging to each of the g clusters following the splitting strategy in the splitEMInit function

verbose

If TRUE, include verbose output

Estimate of normalized per-variable library size. If NA, these are restimated within the function.

Value

pi.initVector of length g containing the estimate for $\hat{\ensuremath\boldsymbol{\pi}}$ corresponding to the highest log likelihood (or completed log likelihood) from the chosen inialization strategy.
lambda.init(d x g) matrix containing the estimate of $\hat{\ensuremath\boldsymbol{\lambda}}$ corresponding to the highest log likelihood (or completed log likelihood) from the chosen initialization strategy, where d is the number of conditions and g is the number of clusters.
lambda(d x g) matrix containing the estimate of $\hat{\ensuremath\boldsymbol{\lambda}}$ arising from the splitting initialization and small EM run for a single split, where d is the number of conditions and g is the number of clusters.
piVector of length g containing the estimate for $\hat{\ensuremath\boldsymbol{\pi}}$ arising from the splitting initialization and small EM run for a single split, where g is the number of clusters.
log.likeLog likelihood arising from the splitting initialization and small EM run for a single split.

Details

In practice, the user will not directly call the initialization functions described here; they are indirectly called for a single number of clusters through the PoisMixClus function (via init.type) or via the PoisMixClusWrapper function for a sequence of cluster numbers (via gmin.init.type and split.init). To initialize parameter values for the EM and CEM algorithms, for the Small-EM strategy (Biernacki et al., 2003) we use the emInit function as follows. For a given number of independent runs (given by init.runs), the following procedure is used to obtain parameter values: first, a K-means algorithm (MacQueen, 1967) is run to partition the data into g clusters ($\hat{\ensuremath\boldsymbol{z}}^{(0)}$). Second, initial parameter values $\ensuremath\boldsymbol{\pi}^{(0)}$ and $\ensuremath\boldsymbol{\lambda}^{(0)}$ are calculated (see Rau et al. (2011) for details). Third, a given number of iterations of an EM algorithm are run (defined by init.iter), using $\ensuremath\boldsymbol{\pi}^{(0)}$ and $\ensuremath\boldsymbol{\lambda}^{(0)}$ as initial values. Finally, among the init.runs sets of parameter values, we use $\hat{\ensuremath\boldsymbol{\lambda}}$ and $\hat{\ensuremath\boldsymbol{\pi}}$ corresponding to the highest log likelihood or completed log likelihood to initialize the subsequent full EM or CEM algorithms, respectively. For the splitting small EM initialization strategy, we implement an approach similar to that described in Papastamoulis et al. (2012), where the cluster from the previous run (with g-1 clusters) with the largest entropy is chosen to be split into two new clusters, followed by a small EM run as described above.

References

Anders, S. and Huber, W. (2010) Differential expression analysis for sequence count data. Genome Biology, 11(R106), 1-28. Biernacki, C., Celeux, G., Govaert, G. (2003) Choosing starting values for the EM algorithm for getting the highest likelhiood in multivariate Gaussian mixture models. Computational Statisitcs and Data Analysis, 41(1), 561-575. MacQueen, J. B. (1967) Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, number 1, pages 281-297. Berkeley, University of California Press. Papastamoulis, P., Martin-Magniette, M.-L., and Maugis-Rabusseau, C. (2012). Efficient estimation of high dimensional mixtures of Poisson generalized linear models via the EM algorithm (submitted). Rau, A., Celeux, G., Martin-Magniette, M.-L., Maugis-Rabusseau, C. (2011). Clustering high-throughput sequencing data with Poisson mixture models. Inria Research Report 7786. Available at http://hal.inria.fr/inria-00638082. Robinson, M. D. and Oshlack, A. (2010) A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology, 11(R25).

Examples

Run this code

set.seed(12345)

## Simulate data as shown in Rau et al. (2011)
## Library size setting "A", high cluster separation
## n = 500 observations

simulate <- PoisMixSim(n = 500, libsize = "A", separation = "high")
y <- simulate$y
conds <- simulate$conditions

## Calculate initial values for lambda and pi using the Small-EM
## initialization (4 classes, PMM-II model with "TC" library size)
##
## init.values <- emInit(y, g = 4, conds, lib.size = TRUE, 
##   lib.type = "TC", alg.type = "EM", 
##    init.runs = 50, init.iter = 10, fixed.lambda = NA,
##    equal.proportions = FALSE, verbose = FALSE)
## pi.init <- init.values$pi.init
## lambda.init <- init.values$lambda.init

Run the code above in your browser using DataLab