Learn R Programming

HTSCluster (version 2.0.4)

PoisMixClus: Poisson mixture model estimation and model selection

Description

These functions implement the EM and CEM algorithms for parameter estimation in a Poisson mixture model for clustering high throughput sequencing observations (e.g., genes) for a single number of clusters (PoisMixClus) or a sequence of cluster numbers (PoisMixClusWrapper). Parameters are initialized using a Small-EM strategy as described in Rau et al. (2011) or the splitting small-EM strategy described in Papastamoulis et al. (2012), and model selection is performed using the ICL criteria. Note that these functions implement the PMM-I and PMM-II models described in Rau et al. (2011).

Usage

PoisMixClus(y, g, conds, lib.size = TRUE, lib.type = "TMM", 
    init.type = "small-em", init.runs = 1, init.iter = 10,
    alg.type = "EM", cutoff = 10e-6, iter = 1000, fixed.lambda = NA,
    equal.proportions = FALSE, prev.labels = NA, 
    prev.probaPost = NA, verbose = FALSE, interpretation = "sum",
	EM.verbose = FALSE, wrapper = FALSE, s = NA)

PoisMixClusWrapper(y, gmin = 1, gmax, conds, lib.size = TRUE,
    lib.type = "TMM", gmin.init.type = "small-em",
    init.runs = 1, init.iter = 10, split.init = TRUE, alg.type = "EM", 
    cutoff = 10e-6, iter = 1000, fixed.lambda = NA, 
    equal.proportions = FALSE, verbose = FALSE, interpretation = "sum",
	EM.verbose = FALSE)

Arguments

y
(n x q) matrix of observed counts for n observations and q variables
g
Number of clusters (a single value). If fixed.lambda contains a list of lambda values to be fixed, g corresponds to the number of clusters in addition to those fixed.
gmin
The minimum number of clusters in a sequence to be tested, where gmin corresponds to the minimum number of clusters in addition to those fixed.
gmax
The maximum number of clusters in a sequence to be tested, where gmax corresponds to the maximum number of clusters in addition to those fixed.
conds
Vector of length q defining the condition (treatment group) for each variable (column) in y
lib.size
If FALSE, the library size parameter is not included in the model (i.e., the PMM-I model). If TRUE, the library size parameter is included in the Poisson mixture model (i.e., the PMM-II model)
lib.type
If lib.size = TRUE, the estimator to be used for the library size parameter: (TC for total count, UQ for upper quantile, Med for median,
init.type
Type of initialization strategy to be used (small-em for the Small-EM strategy described in Rau et al. (2011), and kmeans for a simple K-means initialization)
gmin.init.type
Type of initialization strategy to be used for the minimum number of clusters in a sequence (gmin): (small-em for the Small-EM strategy described in Rau et al. (2011), and kmeans f
init.runs
Number of iterations to be used for the Small-EM strategy described in Rau et al. (2011), with a default value of 1
init.iter
Number of iterations to be used within each run for the Small-EM strategry, with a default value of 10
split.init
If TRUE, the splitting initialization strategy of Papastamoulis et al. (2012) will be used for cluster sizes (gmin+1, ..., gmax). If FALSE, the initialization strategy specified in gmin.init.type<
alg.type
Algorithm to be used for parameter estimation (EM or CEM)
cutoff
Cutoff to declare algorithm convergence (in terms of differences in log likelihoods from one iteration to the next)
iter
Maximum number of iterations to be run for the chosen algorithm
fixed.lambda
If one (or more) clusters with fixed values of lambda is desired, a list containing vectors of length d (the number of conditions). Note that the values of lambda chosen must satisfy the constraint noted in the technical report.
equal.proportions
If TRUE, the cluster proportions are set to be equal for all clusters. Default is FALSE (unequal cluster proportions).
prev.labels
A vector of length n of cluster labels obtained from the previous run (g-1 clusters) to be used with the splitting small-EM strategy described in described in Papastamoulis et al. (2012). For other initialization strategies, this parameter tak
prev.probaPost
An n x (g-1) matrix of the conditional probabilities of each observation belonging to each of the g-1 clusters from the previous run, to be used with the splitting small-EM strategy of described in Papastamoulis et al. (2012
verbose
If TRUE, include verbose output
interpretation
If "sum", cluster behavior is interpreted with respect to overall gene expression level (sums per gene), otherwise for "mean", cluster behavior is interpreted with respect to mean gene expression (means per gene).
EM.verbose
If TRUE, more informative output is printed about the EM algorithm, including the number of iterations run and the difference between log-likelihoods at the last and penultimate iterations.
wrapper
TRUE if the PoisMixClus function is run from within the PoisMixClusWrapper main function, and FALSE otherwise. This mainly helps to avoid recalculating parameters several times that are used throughout
s
Estimate of normalized per-variable library size. If NA, these are re-estimated within the PoisMixClus function. When called as part of the PoisMixClusWrapper main function, these are estimated a single time and passed to the

Value

  • lambda(d x g) matrix containing the estimate of $\hat{\ensuremath\boldsymbol{\lambda}}$
  • piVector of length g containing the estimate of $\hat{\ensuremath\boldsymbol{\pi}}$
  • labelsVector of length n containing the cluster assignments of the n observations
  • probaPostMatrix containing the conditional probabilities of belonging to each cluster for all observations
  • log.likeValue of log likelihood
  • BICValue of BIC criterion
  • ICLValue of ICL criterion
  • alg.typeEstimation algorithm used; matches the argument alg.type above)
  • lib.sizeTRUE if library size included in the model (matches the argument alg.type above)
  • lib.typeType of library size normalization used (if lib.size = TRUE; matches the argument lib.type above)
  • sLibrary size normalization factors used (if lib.size = TRUE)
  • condsConditions specified by user
  • iterationsNumber of iterations run
  • logLikeDiffDifference in log-likelihood between the last and penultimate iterations of the algorithm
  • loglike.allLog likelihoods calculated for each of the fitted models for cluster sizes gmin, ..., gmax
  • capusheResults of capushe model selection, an object of class "Capushe"
  • ICL.allICL values calculated for each of the fitted models for cluster sizes gmin, ..., gmax
  • ICL.resultsObject of class HTSCluster giving the results from the model chosen via the ICL criterion
  • BIC.resultsObject of class HTSCluster giving the results from the model chosen via the BIC
  • DDSE.resultsObject of class HTSCluster giving the results from the model chosen via the DDSE slope heuristics criterion
  • Djump.resultsObject of class HTSCluster giving the results from the model chosen via the Djump slope heuristics criterion
  • all.resultsList of objects of class HTSCluster giving the results for all models for cluster sizes gmin, ..., gmax
  • model.selectionType of criteria used for model selection, equal to NA for direct calls to PoisMixClus or "DDSE", "Djump", "BIC", or "ICL" for the respective selected models for calls to PoisMixClusWrapper

Details

Output of PoisMixClus is an S3 object of class HTSCluster, and output of PoisMixClusWrapper is an S3 object of class HTSClusterWrapper. In a Poisson mixture model, the data $\mathbf{y}$ are assumed to come from g distinct subpopulations (clusters), each of which is modeled separately; the overall population is thus a mixture of these subpopulations. In the case of a Poisson mixture model with g components, the model may be written as $$f(\mathbf{y};g,\ensuremath\boldsymbol{\Psi}_g) = \prod_{i=1}^n \sum_{k=1}^g \pi_k \prod_{j=1}^{d}\prod_{l=1}^{r_j} P(y_{ijl} ; \ensuremath\boldsymbol{\theta}_k)$$ for $i = 1, \ldots, n$ observations in $l = 1, \ldots, r_j$ replicates of $j = 1, \ldots, d$ conditions (treatment groups), where $P(\cdot)$ is the standard Poisson density, $\ensuremath\boldsymbol{\Psi}_g = (\pi_1,\ldots,\pi_{g-1}, \ensuremath\boldsymbol{\theta}^\prime)$, $\ensuremath\boldsymbol{\theta}^\prime$ contains all of the parameters in $\ensuremath\boldsymbol{\theta}_1,\ldots,\ensuremath\boldsymbol{\theta}_g$ assumed to be distinct, and $\ensuremath\boldsymbol{\pi} = (\pi_1,\ldots,\pi_g)^\prime$ are the mixing proportions such that $\pi_k$ is in (0,1) for all k and $\sum_k \pi_k = 1$. We consider two possible parameterizations for the mean $\ensuremath\boldsymbol{\theta}_k = (\mu_{ijlk})$. In the first, called the PMM-I, we consider $$\mu_{ijlk} = w_i \lambda_{jk}$$ where $w_i$ corresponds to the expression level of observation i and $\ensuremath\boldsymbol{\lambda}_k = (\lambda_{1k},\ldots,\lambda_{dk})$ corresponds to the clustering parameters that define the profiles of the genes in cluster k across all variables. In the second parameterization, called the PMM-II, we consider $$\mu_{ijlk} = w_i s_{jl} \lambda_{jk}$$ where $w_i$ and $\ensuremath\boldsymbol{\lambda}_k$ are as before and $s_{jl}$ is the normalized library size (a fixed constant) for replicate l of condition j. See Rau et al. (2011) for more details on the PMM-I and PMM-II. There are two approaches to estimating the parameters of a finite mixture model and obtaining a clustering of the data: the estimation approach (via the EM algorithm) and the clustering approach (via the CEM algorithm). Parameter initialization is done using a Small-EM strategy as described in Rau et al. (2011) via the emInit function. Model selection may be performed using the BIC or ICL criteria.

References

Anders, S. and Huber, W. (2010) Differential expression analysis for sequence count data. Genome Biology, 11(R106), 1-28. Papastamoulis, P., Martin-Magniette, M.-L., and Maugis-Rabusseau, C. (2012). Efficient estimation of high dimensional mixtures of Poisson generalized linear models via the EM algorithm (submitted). Rau, A., Celeux, G., Martin-Magniette, M.-L., Maugis-Rabusseau, C (2011). Clustering high-throughput sequencing data with Poisson mixture models. Inria Research Report 7786. Available at http://hal.inria.fr/inria-00638082.

See Also

probaPost for the calculation of the conditional probability of belonging to a cluster; PoisMixMean for the calculation of the per-cluster conditional mean of each observation; logLikePoisMixDiff for the calculation of the log likelihood of a Poisson mixture model; emInit and kmeanInit for the Small-EM parameter initialization strategy

Examples

Run this code
set.seed(12345)

## Simulate data as shown in Rau et al. (2011)
## Library size setting "A", high cluster separation
## n = 200 observations

simulate <- PoisMixSim(n = 200, libsize = "A", separation = "high")
y <- simulate$y
conds <- simulate$conditions

## Run the PMM model for g = 3
## "TC" library size estimate, EM algorithm

run <- PoisMixClus(y, g = 3, conds = conds, lib.type = "TC") 

## Estimates of pi and lambda for the selected model

pi.est <- run$pi
lambda.est <- run$lambda


## Not run: PMM for 4 total clusters, with one fixed class
## "TC" library size estimate, EM algorithm
##
## run <- PoisMixClus(y, g = 3, lib.type = "TC", conds = conds, 
##    fixed.lambda = list(c(1,1,1))) 
##
##
## Not run: PMM model for 4 clusters, with equal proportions
## "TC" library size estimate, EM algorithm
##
## run <- PoisMixClus(y, g = 4, lib.type = "TC", conds = conds, 
##     equal.proportions = TRUE) 
##
##
## Not run: PMM model for g = 1, ..., 10 clusters, Split Small-EM init
##
## run1.10 <- PoisMixClusWrapper(y, gmin = 1, gmax = 10, conds = conds, 
##	lib.type = "TC")
##
##
## Not run: PMM model for g = 1, ..., 10 clusters, Small-EM init
##
## run1.10bis <-  <- PoisMixClusWrapper(y, gmin = 1, gmax = 10, conds = conds, 
##	lib.type = "TC", split.init = FALSE)
##
##
## Not run: previous model equivalent to the following
##
## for(K in 1:10) {
##	run <- PoisMixClus(y, g = K, conds = conds, lib.type = "TC")
## }

Run the code above in your browser using DataLab