The goal of preprocess_data() is to get relevant clusters for G-, S-, or BLLiM initialization, coupled with a feature selection for high-dimensional datasets. This function is an alternative to the default initialization implemented in gllim(), sllim() and bllim().
In this function, clusters are initialized with K-means, and variable selection is performed with a LASSO (glmnet) within each clusters. Then selected features are merged to get a subset variables before running any prediction method of xLLiM.
preprocess_data(tapp,yapp,in_K,...)
Vector of the indexes of selected variables. Selection is made within clusters and merged hereafter.
Initialization clusters with k-means
An L x N matrix of training responses with variables in rows and subjects in columns
An D x N matrix of training covariates with variables in rows and subjects in columns
Initial number of components or number of clusters
Other arguments of glmnet can be passed
Emeline Perthame (emeline.perthame@pasteur.fr), Emilie Devijver (emilie.devijver@kuleuven.be), Melina Gallopin (melina.gallopin@u-psud.fr)
[1] E. Devijver, M. Gallopin, E. Perthame. Nonlinear network-based quantitative trait prediction from transcriptomic data. Submitted, 2017, available at https://arxiv.org/abs/1701.07899.
xLLiM-package, glmnet-package, kmeans