EM Algorithm for Block diagonal Gaussian Locally Linear Mapping
bllim(tapp,yapp,in_K,in_r=NULL,ninit=20,maxiter=100,verb=0,in_theta=NULL,plot=TRUE)Returns a list with the following elements:
Final log-likelihood
Log-likelihood value at each iteration of the EM algorithm
A vector of length K of mixture weights i.e. prior probabilities for each component
An (L x K) matrix of means of responses (Y)
An (L x L x K) array of K matrices of covariances of responses (Y)
An (D x L x K) array of K matrices of linear transformation matrices
An (D x K) matrix in which affine transformation vectors are in columns
An (D x D x K) array of covariances of \(X\)
An (N x K) matrix of posterior probabilities
The number of parameters estimated in the model
An L x N matrix of training responses with variables in rows and subjects in columns
An D x N matrix of training covariates with variables in rows and subjects in columns
Initial number of components or number of clusters
Initial assignments (default NULL). If NULL, the model is initialized with the best initialisation among 20, computed by a joint Gaussian mixture model on both response and covariates.
Number of random initializations. Not used of in_r is specified. Default is 20 and the random initialization which maximizes the likelihood is retained.
Maximum number of iterations (default 100). The algorithm stops if the number of iterations exceeds maxiter or if the difference of likelihood between two iterations is smaller than a threshold fixed to \(0.001 (max(LL)-min(LL))\) where \(LL\) is the vector of log-likelihoods at the successive iterations.
Verbosity: print out the progression of the algorithm. If verb=0, there is no print, if verb=1, the progression is printed out. Default is 0.
Initial parameters (default NULL), same structure as the output of this function. The EM algorithm can be initialized either with initial assignments or initial parameters values.
Displays plots to allow user to check that the slope heuristics can be applied confidently to select the conditional block structure of predictors, as in the capushe-package package. Default is TRUE.
Emeline Perthame (emeline.perthame@pasteur.fr), Emilie Devijver (emilie.devijver@kuleuven.be), Melina Gallopin (melina.gallopin@u-psud.fr)
The BLLiM model implemented in this function adresses the following non-linear mapping issue:
$$ E(Y | X=x) = g(x),$$
where \(Y\) is a L-vector of multivariate responses and \(X\) is a large D-vector of covariates' profiles such that \(D \gg L\). As gllim and sllim, the bllim function aims at estimating the non linear regression function \(g\).
First, the methods of this package are based on an inverse regression strategy. The inverse conditional relation \(p(X | Y)\) is specified in a way that the forward relation of interest \(p(Y | X)\) can be deduced in closed-from. Under some hypothesis on covariance structures, the large number \(D\) of covariates is handled by this inverse regression trick, which acts as a dimension reduction technique. The number of parameters to estimate is therefore drastically reduced. Second, we propose to approximate the non linear \(g\) regression function by a piecewise affine function. Therefore, a hidden discrete variable \(Z\) is introduced, in order to divide the space into \(K\) regions such that an affine model holds between responses Y and variables X in each region \(k\): $$X = \sum_{k=1}^K I_{Z=k} (A_k Y + b_k + E_k)$$ where \(A_k\) is a \(D \times L\) matrix of coeffcients for regression \(k\), \(b_k\) is a D-vector of intercepts and \(E_k\) is a Gaussian noise with covariance matrix \(\Sigma_k\).
BLLiM is defined as the following hierarchical Gaussian mixture model for the inverse conditional density \((X | Y)\): $$p(X | Y=y,Z=k;\theta) = N(X; A_kx+b_k,\Sigma_k)$$ $$p(Y | Z=k; \theta) = N(Y; c_k,\Gamma_k)$$ $$p(Z=k)=\pi_k$$ where \(\Sigma_k\) is a \(D \times D\) block diagonal covariance structure automatically learnt from data. \(\theta\) is the set of parameters \(\theta=(\pi_k,c_k,\Gamma_k,A_k,b_k,\Sigma_k)_{k=1}^K\). The forward conditional density of interest \(p(Y | X)\) is deduced from these equations and is also a Gaussian mixture of regression model.
For a given number of affine components (or clusters) K and a given block structure, the number of parameters to estimate is: $$(K-1)+ K(DL+D+L+ nbpar_{\Sigma}+L(L+1)/2)$$ where \(L\) is the dimension of the response, \(D\) is the dimension of covariates and \(nbpar_{\Sigma}\) is the total number of parameters in the large covariance matrix \(\Sigma_k\) in each cluster. This number of parameters depends on the number and size of blocks in each matrices.
Two hyperparameters must be estimated to run BLLiM:
Number of mixtures components (or clusters) \(K\): we propose to use BIC criterion or slope heuristics as implemented in capushe-package
For a given number of clusters K, the block structure of large covariance matrices specific of each cluster: the size and the number of blocks of each \(\Sigma_k\) matrix is automatically learnt from data, using an extension of the shock procedure (see shock-package). This procedure is based on a successive thresholding of sample conditional covariance matrix within clusters, building a collection of block structure candidates. The final block structure is retained using slope heuristics.
BLLiM is not only a prediction model but also an interpretable tool. For example, it is useful for the analysis of transcriptomic data. Indeed, if covariates are genes and response is a phenotype, the model provides clusters of individuals based on the relation between gene expression data and the phenotype, and also leads to infer a gene regulatory network specific for each cluster of individuals.
[1] E. Devijver, M. Gallopin, E. Perthame. Nonlinear network-based quantitative trait prediction from transcriptomic data. Submitted, 2017, available at https://arxiv.org/abs/1701.07899.
xLLiM-package, emgm, gllim_inverse_map,capushe-package,shock-package
data(data.xllim)
## Setting 5 components in the model
K = 5
## the model can be initialized by running an EM algorithm for Gaussian Mixtures (EMGM)
r = emgm(data.xllim, init=K);
## and then the gllim model is estimated
responses = data.xllim[1:2,] # 2 responses in rows and 100 observations in columns
covariates = data.xllim[3:52,] # 50 covariates in rows and 100 observations in columns
## if initialization is not specified, the model is automatically initialized by EMGM
# mod = bllim(responses,covariates,in_K=K)
## Prediction can be performed using prediction function gllim_inverse_map
# pred = gllim_inverse_map(covariates,mod)$x_exp
Run the code above in your browser using DataLab