The GLLiM model implemented in this function adresses the following non-linear mapping issue:
$$ E(Y | X=x) = g(x),$$
where \(Y\) is a L-vector of multivariate responses and \(X\) is a large D-vector of covariates' profiles such that \(D \gg L\). The methods implemented in this package aims at estimating the non linear regression function \(g\).
First, the methods of this package are based on an inverse regression strategy. The inverse conditional relation \(p(X | Y)\) is specified in a way that the forward relation of interest \(p(Y | X)\) can be deduced in closed-from. Under some hypothesis on covariance structures, the large number \(D\) of covariates is handled by this inverse regression trick, which acts as a dimension reduction technique. The number of parameters to estimate is therefore drastically reduced. Second, we propose to approximate the non linear \(g\) regression function by a piecewise affine function. Therefore, a hidden discrete variable \(Z\) is introduced, in order to divide the space into \(K\) regions such that an affine model holds between responses Y and variables X in each region \(k\):
$$X = \sum_{k=1}^K I_{Z=k} (A_k Y + b_k + E_k)$$
where \(A_k\) is a \(D \times L\) matrix of coeffcients for regression \(k\), \(b_k\) is a D-vector of intercepts and \(E_k\) is a Gaussian noise with covariance matrix \(\Sigma_k\).
GLLiM is defined as the following hierarchical Gaussian mixture model for the inverse conditional density \((X | Y)\):
$$p(X | Y=y,Z=k;\theta) = N(X; A_kx+b_k,\Sigma_k)$$
$$p(Y | Z=k; \theta) = N(Y; c_k,\Gamma_k)$$
$$p(Z=k)=\pi_k$$
where \(\theta\) is the set of parameters \(\theta=(\pi_k,c_k,\Gamma_k,A_k,b_k,\Sigma_k)_{k=1}^K\).
The forward conditional density of interest \(p(Y | X)\) is deduced from these equations and is also a Gaussian mixture of regression model.
gllim allows the addition of \(L_w\) latent variables in order to account for correlation among covariates or if it is supposed that responses are only partially observed. Adding latent factors is known to improve prediction accuracy, if \(L_w\) is not too large with regard to the number of covariates. When latent factors are added, the dimension of the response is \(L=L_t+L_w\) and \(L=L_t\) otherwise.
For GLLiM, the number of parameters to estimate is:
$$(K-1)+ K(DL+D+L_t+ nbpar_{\Sigma}+nbpar_{\Gamma})$$
where \(L=L_w+L_t\) and \(nbpar_{\Sigma}\) (resp. \(nbpar_{\Gamma}\)) is the number of parameters in each of the large (resp. small) covariance matrix \(\Sigma_k\) (resp. \(\Gamma_k\)). For example,
if the constraint on \(\Sigma\) is cstr$Sigma="i", then \(nbpar_{\Sigma}=1\),which is the default constraint in the gllim function
if the constraint on \(\Sigma\) is cstr$Sigma="d", then \(nbpar_{\Sigma}=D\),
if the constraint on \(\Sigma\) is cstr$Sigma="", then \(nbpar_{\Sigma}=D(D+1)/2\),
if the constraint on \(\Sigma\) is cstr$Sigma="*", then \(nbpar_{\Sigma}=D(D+1)/(2K)\).
The rule to compute the number of parameters of \(\Gamma\) is the same as \(\Sigma\), replacing D by \(L_t\). Currently the \(\Gamma_k\) matrices are not constrained and \(nbpar_{\Gamma}=L_t(L_t+1)/2\) because for indentifiability reasons the \(L_w\) part is set to the identity matrix.
The user must choose the number of mixtures components \(K\) and, if needed, the number of latent factors \(L_w\). For small datasets (less than 100 observations), it is suggested to select both \((K,L_w)\) by minimizing the BIC criterion. For larger datasets, it is suggested to save computational time, to set \(L_w\) using BIC while setting \(K\) to an arbitrary value large enough to catch non linear relations between responses and covariates and small enough to have several observations (at least 10) in each clusters. Indeed, for large datasets, the number of clusters should not have a strong impact on the results while it is sufficiently large.