bllim: EM Algorithm for Block diagonal Gaussian Locally Linear Mapping

Description

EM Algorithm for Block diagonal Gaussian Locally Linear Mapping

Usage

bllim(tapp,yapp,in_K,in_r=NULL,ninit=20,maxiter=100,verb=0,in_theta=NULL,plot=TRUE)

Value

Returns a list with the following elements:

LLf: Final log-likelihood
LL: Log-likelihood value at each iteration of the EM algorithm
pi: A vector of length K of mixture weights i.e. prior probabilities for each component
c: An (L x K) matrix of means of responses (Y)
Gamma: An (L x L x K) array of K matrices of covariances of responses (Y)
A: An (D x L x K) array of K matrices of linear transformation matrices
b: An (D x K) matrix in which affine transformation vectors are in columns
Sigma: An (D x D x K) array of covariances of $X$
r: An (N x K) matrix of posterior probabilities
nbpar: The number of parameters estimated in the model

Arguments

tapp: An L x N matrix of training responses with variables in rows and subjects in columns
yapp: An D x N matrix of training covariates with variables in rows and subjects in columns
in_K: Initial number of components or number of clusters
in_r: Initial assignments (default NULL). If NULL, the model is initialized with the best initialisation among 20, computed by a joint Gaussian mixture model on both response and covariates.
ninit: Number of random initializations. Not used of in_r is specified. Default is 20 and the random initialization which maximizes the likelihood is retained.
maxiter: Maximum number of iterations (default 100). The algorithm stops if the number of iterations exceeds maxiter or if the difference of likelihood between two iterations is smaller than a threshold fixed to $0.001 (max(LL)-min(LL))$ where $LL$ is the vector of log-likelihoods at the successive iterations.
verb: Verbosity: print out the progression of the algorithm. If verb=0, there is no print, if verb=1, the progression is printed out. Default is 0.
in_theta: Initial parameters (default NULL), same structure as the output of this function. The EM algorithm can be initialized either with initial assignments or initial parameters values.
plot: Displays plots to allow user to check that the slope heuristics can be applied confidently to select the conditional block structure of predictors, as in the capushe-package package. Default is TRUE.

Author

Emeline Perthame (emeline.perthame@pasteur.fr), Emilie Devijver (emilie.devijver@kuleuven.be), Melina Gallopin (melina.gallopin@u-psud.fr)

Details

The BLLiM model implemented in this function adresses the following non-linear mapping issue: $$ E(Y | X=x) = g(x),$$ where $Y$ is a L-vector of multivariate responses and $X$ is a large D-vector of covariates' profiles such that $D \gg L$. As gllim and sllim, the bllim function aims at estimating the non linear regression function $g$.

First, the methods of this package are based on an inverse regression strategy. The inverse conditional relation $p(X | Y)$ is specified in a way that the forward relation of interest $p(Y | X)$ can be deduced in closed-from. Under some hypothesis on covariance structures, the large number $D$ of covariates is handled by this inverse regression trick, which acts as a dimension reduction technique. The number of parameters to estimate is therefore drastically reduced. Second, we propose to approximate the non linear $g$ regression function by a piecewise affine function. Therefore, a hidden discrete variable $Z$ is introduced, in order to divide the space into $K$ regions such that an affine model holds between responses Y and variables X in each region $k$: $$X = \sum_{k=1}^K I_{Z=k} (A_k Y + b_k + E_k)$$ where $A_k$ is a $D \times L$ matrix of coeffcients for regression $k$, $b_k$ is a D-vector of intercepts and $E_k$ is a Gaussian noise with covariance matrix $\Sigma_k$.

BLLiM is defined as the following hierarchical Gaussian mixture model for the inverse conditional density $(X | Y)$: $$p(X | Y=y,Z=k;\theta) = N(X; A_kx+b_k,\Sigma_k)$$ $$p(Y | Z=k; \theta) = N(Y; c_k,\Gamma_k)$$ $$p(Z=k)=\pi_k$$ where $\Sigma_k$ is a $D \times D$ block diagonal covariance structure automatically learnt from data. $\theta$ is the set of parameters $\theta=(\pi_k,c_k,\Gamma_k,A_k,b_k,\Sigma_k)_{k=1}^K$. The forward conditional density of interest $p(Y | X)$ is deduced from these equations and is also a Gaussian mixture of regression model.

For a given number of affine components (or clusters) K and a given block structure, the number of parameters to estimate is: $$(K-1)+ K(DL+D+L+ nbpar_{\Sigma}+L(L+1)/2)$$ where $L$ is the dimension of the response, $D$ is the dimension of covariates and $nbpar_{\Sigma}$ is the total number of parameters in the large covariance matrix $\Sigma_k$ in each cluster. This number of parameters depends on the number and size of blocks in each matrices.

Two hyperparameters must be estimated to run BLLiM:

Number of mixtures components (or clusters) $K$: we propose to use BIC criterion or slope heuristics as implemented in capushe-package
For a given number of clusters K, the block structure of large covariance matrices specific of each cluster: the size and the number of blocks of each $\Sigma_k$ matrix is automatically learnt from data, using an extension of the shock procedure (see shock-package). This procedure is based on a successive thresholding of sample conditional covariance matrix within clusters, building a collection of block structure candidates. The final block structure is retained using slope heuristics.

BLLiM is not only a prediction model but also an interpretable tool. For example, it is useful for the analysis of transcriptomic data. Indeed, if covariates are genes and response is a phenotype, the model provides clusters of individuals based on the relation between gene expression data and the phenotype, and also leads to infer a gene regulatory network specific for each cluster of individuals.

References

[1] E. Devijver, M. Gallopin, E. Perthame. Nonlinear network-based quantitative trait prediction from transcriptomic data. Submitted, 2017, available at https://arxiv.org/abs/1701.07899.

Examples

Run this code

data(data.xllim)

## Setting 5 components in the model
K = 5

## the model can be initialized by running an EM algorithm for Gaussian Mixtures (EMGM)
r = emgm(data.xllim, init=K); 
## and then the gllim model is estimated
responses = data.xllim[1:2,] # 2 responses in rows and 100 observations in columns
covariates = data.xllim[3:52,] # 50 covariates in rows and 100 observations in columns

## if initialization is not specified, the model is automatically initialized by EMGM
# mod = bllim(responses,covariates,in_K=K)

## Prediction can be performed using prediction function gllim_inverse_map
# pred = gllim_inverse_map(covariates,mod)$x_exp

Run the code above in your browser using DataLab