Performs unsupervised feature selection followed by principal component analysis (PCA) under a row-sparse model using the ReFACTor algorithm. For example, in the context of tissue-level bulk DNA methylation data coming from a mixture of cell types (i.e. the input is methylation sites by individuals), refactor
allows to capture the variation in cell-type composition, which was shown to be a dominant sparse signal in methylation data.
refactor(X, k, sparsity = 500, C = NULL, C.remove = FALSE,
sd_threshold = 0.02, num_comp = NULL, rand_svd = FALSE,
log_file = "TCA.log", debug = FALSE)
An m
by n
matrix of measurements of m
features for n
observations. Each column in X
is assumed to be a mixture of k
different sources. Note that X
must include row names and column names and that NA values are currently not supported.
A numeric value indicating the dimension of the signal in the data (i.e. the number of sources).
A numeric value indicating the sparsity of the signal in the data (the number of signal rows).
An n
by p
design matrix of covariates that will be accounted for in the feature selection step. Note that C
must include row names and column names and that NA values are currently not supported; ; set C
to be NULL
if there are no such covariates.
A logical value indicating whether the covariates in X should be accounted for not only in the feature selection step, but also in the final calculation of the principal components (i.e. if C.remove == TRUE
then the selected features will be adjusted for the covariates in C
prior to calculating principal components). Note that setting C.remove
to be TRUE
is desired when ReFACTor is intended to be used for correction in downstream analysis, whereas setting C.remove
to be FALSE
is desired when ReFACTor is merely used for capturing the sparse signals in the data (i.e. regardless of correction).
A numeric value indicating a standard deviation threshold to be used for excluding low-variance features in X
(i.e. features with standard deviation lower than sd_threshold
will be excluded). Set sd_threshold
to be NULL
for turning off this filter. Note that removing features with very low variability tends to improve speed and performance.
A numeric value indicating the number of ReFACTor components to return.
A logical value indicating whether to use random svd for estimating the low-rank structure of the data in the first step of the algorithm; random svd can result in a substantial speedup for large data.
A path to an output log file. Note that if the file log_file
already exists then logs will be appended to the end of the file. Set log_file
to NULL
to prevent output from being saved into a file.
A logical value indicating whether to set the logger to a more detailed debug level; please set debug
to TRUE
before reporting issues.
A list with the estimated components of the ReFACTor model.
An n
by num_comp
matrix of the ReFACTor components (the projection scores).
A sparsity
by num_comp
matrix of the coefficients of the ReFACTor components (the projection loadings).
A vector with the features in the data, ranked by their scores in the feature selection step of the algorithm; the top scoring features (set according to the argument sparsity
) are used for calculating the ReFACTor components. Note that features that were excluded according to sd_threshold
will not appear in this ranked_list
.
ReFACTor is a two-step algorithm for sparse principal component analysis (PCA) under a row-sparse model. The algorithm performs an unsupervised feature selection by ranking the features based on their correlation with their values under a low-rank representation of the data, followed by a calculation of principal components using the top ranking features (ReFACTor components).
Note that ReFACTor is tuned towards capturing sparse signals of the dominant sources of variation in the data. Therefore, in the presence of other potentially dominant factors in the data (i.e. beyond the variation of interest), these factors should be accounted for by including them as covariates (see argument C
). In cases where the ReFACTor components are designated to be used as covariates in a downstream analysis alongside the covariates in C
(e.g., in a standard regression analysis or in a TCA regression), it is advised to set the argument C.remove
to be TRUE
. This will adjust the selected features for the information in C
prior to the calculation of the ReFACTor components, which will therefore capture only signals that is not present in C
(and as a result may benefit the downstream analysis by potentially capturing more signals beyond the information in C
).
Rahmani E, Zaitlen N, Baran Y, Eng C, Hu D, Galanter J, Oh S, Burchard EG, Eskin E, Zou J, Halperin E. Sparse PCA corrects for cell type heterogeneity in epigenome-wide association studies. Nature Methods 2016.
Rahmani E, Zaitlen N, Baran Y, Eng C, Hu D, Galanter J, Oh S, Burchard EG, Eskin E, Zou J, Halperin E. Correcting for cell-type heterogeneity in DNA methylation: a comprehensive evaluation. Nature Methods 2017.
# NOT RUN {
data <- test_data(100, 200, 3, 0, 0, 0.01)
ref <- refactor(data$X, k = 3, sparsity = 50)
# }
Run the code above in your browser using DataLab