Learn R Programming

HTSCluster (version 1.1)

PoisMixClus: Poisson mixture model estimation and model selection

Description

This function implements the EM and CEM algorithms for parameter estimation in a Poisson mixture model for clustering high throughput sequencing observations (e.g., genes). Parameters are initialized using a Small-EM strategy as described in Rau et al. (2011), and model selection may be performed using the BIC and ICL criteria. Note that this function implements the PMM-I and PMM-II models described in Rau et al. (2011).

Usage

PoisMixClus(y, gmin, gmax, lib.size = TRUE, lib.type = "TC", 
    conds, init.type = "small-em", alg.type = "EM", 
    cutoff = 1e-05, iter = 1000, mean.filter = FALSE, verbose = FALSE)

Arguments

y
(n x q) matrix of observed counts for n observations and q variables
gmin
Minimum number of clusters to be fit (must be less than or equal to gmax)
gmax
Maximum number of clusters to be fit (must be greater than or equal to gmax)
lib.size
If FALSE, the library size parameter is not included in the model (i.e., the PMM-I model). If TRUE, the library size parameter is included in the Poisson mixture model (i.e., the PMM-II model)
lib.type
If lib.size = TRUE, the estimator to be used for the library size parameter (TC for total count, Q for quantile, and MedRatio for the median ratio of An
conds
Vector of length q defining the condition (treatment group) for each variable (column) in y
init.type
Type of initialization strategy to be used (small-em for the Small-EM strategy described in Rau et al. (2011), and kmeans for a simple K-means initialization)
alg.type
Algorithm to be used for parameter estimation (EM or CEM)
cutoff
Cutoff to declare algorithm convergence (in terms of differences in log likelihoods from one iteration to the next)
iter
Maximum number of iterations to be run for the chosen algorithm
mean.filter
Option threshold value for filtering genes with a mean count across all samples less than mean.filter.
verbose
If TRUE, include verbose output

Value

  • lambdaList of length (gmax-gmin+1) containing the estimates $\hat{\ensuremath\boldsymbol{\lambda}}$ for each of the models (g = gmin, ..., gmax). For a given model g, lambda[[g]] is a matrix of dimension (d x g), where d is the number of conditions (treatment groups) and g is the number of clusters
  • piList of length (gmax-gmin+1) containing the estimates $\hat{\ensuremath\boldsymbol{\pi}}$ for each of the models (g = gmin, ..., gmax). For a given model g, pi[[g]] is a vector of length g, where g is the number of clusters
  • labelsMatrix of dimension (n x (gmax-gmin+1)) containing the label assignments for each of the n observations in each of the models (g = gmin, ..., gmax)
  • probaPostList of length (gmax-gmin+1) containing the conditional probabilities of belonging to each cluster for all observations for each of the models (g = gmin, ..., gmax). For a given model g, probaPost[[g]] is a (n x g) matrix containing the conditional probailities of belonging to each of the g clusters for all observations
  • BIC.allBIC values for each of the models (g = gmin, ..., gmax)
  • ICL.allICL values for each of the models (g = gmin, ..., gmax)
  • alg.typeAlgorithm used for parameter estimation (matches the argument alg.type above)
  • BICMaximum BIC value across all models considered (g = gmin, ..., gmax)
  • ICLMaximum ICL value across all models considered (g = gmin, ..., gmax)
  • g.BICNumber of clusters corresponding to the maximum BIC value
  • g.ICLNumber of clusters corresponding to the maximum ICL value
  • labels.BICVector of length n containing the cluster assignments of the n observations in the model selected via the BIC (with number of clusters g.BIC)
  • labels.ICLVector of length n containing the cluster assignments of the n observations in the model selected via the ICL (with number of clusters g.ICL)
  • lambda.BIC(d x g.BIC) matrix containing the estimate of $\hat{\ensuremath\boldsymbol{\lambda}}$ for the model selected via the BIC (with number of clusters g.BIC)
  • pi.BICVector of length g.BIC containing the estimate of $\hat{\ensuremath\boldsymbol{\pi}}$ for the model selected via the BIC (with number of clusters g.BIC)
  • lambda.ICL(d x g.ICL) matrix containing the estimate of $\hat{\ensuremath\boldsymbol{\lambda}}$ for the model selected via the ICL (with number of clusters g.ICL)
  • pi.ICLVector of length g.ICL containing the estimate of $\hat{\ensuremath\boldsymbol{\pi}}$ for the model selected via the ICL (with number of clusters g.ICL)
  • probaPost.BIC(n x g.BIC) matrix containing the conditional probabilities of belonging to each cluster for all observations for the model selected via the BIC (with number of clusters g.BIC)
  • probaPost.ICL(n x g.ICL) matrix containing the conditional probabilities of belonging to each cluster for all observations for the model selected via the ICL (with number of clusters g.ICL)
  • lib.sizeTRUE if library size included in the model (matches the argument alg.type above)
  • lib.typeType of library size normalization used (if lib.size = TRUE; matches the argument alg.type above)
  • sLibrary size normalization factors used (if lib.size = TRUE)
  • yData (matches the argument y above, unless observations were removed using the optional mean.filter)
  • ind.removeIndices for observations removed using the optional mean.filter
  • mean.filterMatched the argument mean.filter above

Details

Output is an S3 object of class HTSCluster. In a Poisson mixture model, the data $\mathbf{y}$ are assumed to come from g distinct subpopulations (clusters), each of which is modeled separately; the overall population is thus a mixture of these subpopulations. In the case of a Poisson mixture model with g components, the model may be written as $$f(\mathbf{y};g,\ensuremath\boldsymbol{\Psi}_g) = \prod_{i=1}^n \sum_{k=1}^g \pi_k \prod_{j=1}^{d}\prod_{l=1}^{r_j} P(y_{ijl} ; \ensuremath\boldsymbol{\theta}_k)$$ for $i = 1, \ldots, n$ observations in $l = 1, \ldots, r_j$ replicates of $j = 1, \ldots, d$ conditions (treatment groups), where $P(\cdot)$ is the standard Poisson density, $\ensuremath\boldsymbol{\Psi}_g = (\pi_1,\ldots,\pi_{g-1}, \ensuremath\boldsymbol{\theta}^\prime)$, $\ensuremath\boldsymbol{\theta}^\prime$ contains all of the parameters in $\ensuremath\boldsymbol{\theta}_1,\ldots,\ensuremath\boldsymbol{\theta}_g$ assumed to be distinct, and $\ensuremath\boldsymbol{\pi} = (\pi_1,\ldots,\pi_g)^\prime$ are the mixing proportions such that $\pi_k$ is in (0,1) for all k and $\sum_k \pi_k = 1$. We consider two possible parameterizations for the mean $\ensuremath\boldsymbol{\theta}_k = (\mu_{ijlk})$. In the first, called the PMM-I, we consider $$\mu_{ijlk} = w_i \lambda_{jk}$$ where $w_i$ corresponds to the expression level of observation i and $\ensuremath\boldsymbol{\lambda}_k = (\lambda_{1k},\ldots,\lambda_{dk})$ corresponds to the clustering parameters that define the profiles of the genes in cluster k across all variables. In the second parameterization, called the PMM-II, we consider $$\mu_{ijlk} = w_i s_{jl} \lambda_{jk}$$ where $w_i$ and $\ensuremath\boldsymbol{\lambda}_k$ are as before and $s_{jl}$ is the normalized library size (a fixed constant) for replicate l of condition j. See Rau et al. (2011) for more details on the PMM-I and PMM-II. There are two approaches to estimating the parameters of a finite mixture model and obtaining a clustering of the data: the estimation approach (via the EM algorithm) and the clustering approach (via the CEM algorithm). Parameter initialization is done using a Small-EM strategy as described in Rau et al. (2011) via the emInit function. Model selection may be performed using the BIC or ICL criteria.

References

Anders, S. and Huber, W. (2010) Differential expression analysis for sequence count data. Genome Biology, 11(R106), 1-28. Rau, A., Celeux, G., Martin-Magniette, M.-L., Maugis-Rabusseau, C (2011). Clustering high-throughput sequencing data with Poisson mixture models. Inria Research Report 7786. Available at http://hal.inria.fr/inria-00638082.

See Also

probaPost for the calculation of the conditional probability of belonging to a cluster; PoisMixMean for the calculation of the per-cluster conditional mean of each observation; logLikePoisMix for the calculation of the log likelihood of a Poisson mixture model; emInit and kmeanInit for the Small-EM parameter initialization strategy

Examples

Run this code
set.seed(12345)

## Simulate data as shown in Rau et al. (2011)
## Library size setting "A", high cluster separation
## n = 200 observations

simulate <- PoisMixSim(n = 200, libsize = "A", separation = "high")
y <- simulate$y
conds <- simulate$conditions

## Run the PMM-II model for g = {3, 4, 5}
## "TC" library size estimate, EM algorithm
## Model selection via the ICL

run <- PoisMixClus(y, gmin = 3, gmax = 5, lib.size = TRUE, lib.type = "TC",
    conds = conds, init.type = "small-em") 

## Estimates of pi and lambda for the selected model
pi.est <- run$pi.ICL
lambda.est <- run$lambda.ICL

Run the code above in your browser using DataLab