PoisMixClus: Poisson mixture model estimation and model selection

Description

This function implements the EM and CEM algorithms for parameter estimation in a Poisson mixture model for clustering high throughput sequencing observations (e.g., genes). Parameters are initialized using a Small-EM strategy as described in Rau et al. (2011), and model selection may be performed using the BIC and ICL criteria. Note that this function implements the PMM-I and PMM-II models described in Rau et al. (2011).

Usage

PoisMixClus(y, gmin, gmax, lib.size = TRUE, lib.type = "TC", 
    conds, init.type = "small-em", alg.type = "EM", 
    cutoff = 1e-05, iter = 1000, mean.filter = FALSE, verbose = FALSE)

Arguments

(n x q) matrix of observed counts for n observations and q variables

gmin

Minimum number of clusters to be fit (must be less than or equal to gmax)

gmax

Maximum number of clusters to be fit (must be greater than or equal to gmax)

lib.size

If FALSE, the library size parameter is not included in the model (i.e., the PMM-I model). If TRUE, the library size parameter is included in the Poisson mixture model (i.e., the PMM-II model)

lib.type

If lib.size = TRUE, the estimator to be used for the library size parameter (TC for total count, Q for quantile, and MedRatio for the median ratio of An

conds

Vector of length q defining the condition (treatment group) for each variable (column) in y

init.type

Type of initialization strategy to be used (small-em for the Small-EM strategy described in Rau et al. (2011), and kmeans for a simple K-means initialization)

alg.type

Algorithm to be used for parameter estimation (EM or CEM)

cutoff

Cutoff to declare algorithm convergence (in terms of differences in log likelihoods from one iteration to the next)

iter

Maximum number of iterations to be run for the chosen algorithm

mean.filter

Option threshold value for filtering genes with a mean count across all samples less than mean.filter.

verbose

If TRUE, include verbose output

Value

lambdaList of length (gmax-gmin+1) containing the estimates $\hat{\ensuremath\boldsymbol{\lambda}}$ for each of the models (g = gmin, ..., gmax). For a given model g, lambda[[g]] is a matrix of dimension (d x g), where d is the number of conditions (treatment groups) and g is the number of clusters
piList of length (gmax-gmin+1) containing the estimates $\hat{\ensuremath\boldsymbol{\pi}}$ for each of the models (g = gmin, ..., gmax). For a given model g, pi[[g]] is a vector of length g, where g is the number of clusters
labelsMatrix of dimension (n x (gmax-gmin+1)) containing the label assignments for each of the n observations in each of the models (g = gmin, ..., gmax)
probaPostList of length (gmax-gmin+1) containing the conditional probabilities of belonging to each cluster for all observations for each of the models (g = gmin, ..., gmax). For a given model g, probaPost[[g]] is a (n x g) matrix containing the conditional probailities of belonging to each of the g clusters for all observations
BIC.allBIC values for each of the models (g = gmin, ..., gmax)
ICL.allICL values for each of the models (g = gmin, ..., gmax)
alg.typeAlgorithm used for parameter estimation (matches the argument alg.type above)
BICMaximum BIC value across all models considered (g = gmin, ..., gmax)
ICLMaximum ICL value across all models considered (g = gmin, ..., gmax)
g.BICNumber of clusters corresponding to the maximum BIC value
g.ICLNumber of clusters corresponding to the maximum ICL value
labels.BICVector of length n containing the cluster assignments of the n observations in the model selected via the BIC (with number of clusters g.BIC)
labels.ICLVector of length n containing the cluster assignments of the n observations in the model selected via the ICL (with number of clusters g.ICL)
lambda.BIC(d x g.BIC) matrix containing the estimate of $\hat{\ensuremath\boldsymbol{\lambda}}$ for the model selected via the BIC (with number of clusters g.BIC)
pi.BICVector of length g.BIC containing the estimate of $\hat{\ensuremath\boldsymbol{\pi}}$ for the model selected via the BIC (with number of clusters g.BIC)
lambda.ICL(d x g.ICL) matrix containing the estimate of $\hat{\ensuremath\boldsymbol{\lambda}}$ for the model selected via the ICL (with number of clusters g.ICL)
pi.ICLVector of length g.ICL containing the estimate of $\hat{\ensuremath\boldsymbol{\pi}}$ for the model selected via the ICL (with number of clusters g.ICL)
probaPost.BIC(n x g.BIC) matrix containing the conditional probabilities of belonging to each cluster for all observations for the model selected via the BIC (with number of clusters g.BIC)
probaPost.ICL(n x g.ICL) matrix containing the conditional probabilities of belonging to each cluster for all observations for the model selected via the ICL (with number of clusters g.ICL)
lib.sizeTRUE if library size included in the model (matches the argument alg.type above)
lib.typeType of library size normalization used (if lib.size = TRUE; matches the argument alg.type above)
sLibrary size normalization factors used (if lib.size = TRUE)
yData (matches the argument y above, unless observations were removed using the optional mean.filter)
ind.removeIndices for observations removed using the optional mean.filter
mean.filterMatched the argument mean.filter above

Details

Output is an S3 object of class HTSCluster. In a Poisson mixture model, the data $\mathbf{y}$ are assumed to come from g distinct subpopulations (clusters), each of which is modeled separately; the overall population is thus a mixture of these subpopulations. In the case of a Poisson mixture model with g components, the model may be written as $$f(\mathbf{y};g,\ensuremath\boldsymbol{\Psi}_g) = \prod_{i=1}^n \sum_{k=1}^g \pi_k \prod_{j=1}^{d}\prod_{l=1}^{r_j} P(y_{ijl} ; \ensuremath\boldsymbol{\theta}_k)$$ for $i = 1, \ldots, n$ observations in $l = 1, \ldots, r_j$ replicates of $j = 1, \ldots, d$ conditions (treatment groups), where $P(\cdot)$ is the standard Poisson density, $\ensuremath\boldsymbol{\Psi}_g = (\pi_1,\ldots,\pi_{g-1}, \ensuremath\boldsymbol{\theta}^\prime)$, $\ensuremath\boldsymbol{\theta}^\prime$ contains all of the parameters in $\ensuremath\boldsymbol{\theta}_1,\ldots,\ensuremath\boldsymbol{\theta}_g$ assumed to be distinct, and $\ensuremath\boldsymbol{\pi} = (\pi_1,\ldots,\pi_g)^\prime$ are the mixing proportions such that $\pi_k$ is in (0,1) for all k and $\sum_k \pi_k = 1$. We consider two possible parameterizations for the mean $\ensuremath\boldsymbol{\theta}_k = (\mu_{ijlk})$. In the first, called the PMM-I, we consider $$\mu_{ijlk} = w_i \lambda_{jk}$$ where $w_i$ corresponds to the expression level of observation i and $\ensuremath\boldsymbol{\lambda}_k = (\lambda_{1k},\ldots,\lambda_{dk})$ corresponds to the clustering parameters that define the profiles of the genes in cluster k across all variables. In the second parameterization, called the PMM-II, we consider $$\mu_{ijlk} = w_i s_{jl} \lambda_{jk}$$ where $w_i$ and $\ensuremath\boldsymbol{\lambda}_k$ are as before and $s_{jl}$ is the normalized library size (a fixed constant) for replicate l of condition j. See Rau et al. (2011) for more details on the PMM-I and PMM-II. There are two approaches to estimating the parameters of a finite mixture model and obtaining a clustering of the data: the estimation approach (via the EM algorithm) and the clustering approach (via the CEM algorithm). Parameter initialization is done using a Small-EM strategy as described in Rau et al. (2011) via the emInit function. Model selection may be performed using the BIC or ICL criteria.

References

Anders, S. and Huber, W. (2010) Differential expression analysis for sequence count data. Genome Biology, 11(R106), 1-28. Rau, A., Celeux, G., Martin-Magniette, M.-L., Maugis-Rabusseau, C (2011). Clustering high-throughput sequencing data with Poisson mixture models. Inria Research Report 7786. Available at http://hal.inria.fr/inria-00638082.

Examples

Run this code

set.seed(12345)

## Simulate data as shown in Rau et al. (2011)
## Library size setting "A", high cluster separation
## n = 200 observations

simulate <- PoisMixSim(n = 200, libsize = "A", separation = "high")
y <- simulate$y
conds <- simulate$conditions

## Run the PMM-II model for g = {3, 4, 5}
## "TC" library size estimate, EM algorithm
## Model selection via the ICL

run <- PoisMixClus(y, gmin = 3, gmax = 5, lib.size = TRUE, lib.type = "TC",
    conds = conds, init.type = "small-em") 

## Estimates of pi and lambda for the selected model
pi.est <- run$pi.ICL
lambda.est <- run$lambda.ICL

Run the code above in your browser using DataLab