refactor: Sparse principal component analysis using ReFACTor

Description

Performs unsupervised feature selection followed by principal component analysis (PCA) under a row-sparse model using the ReFACTor algorithm. For example, in the context of tissue-level bulk DNA methylation data coming from a mixture of cell types (i.e. the input is methylation sites by individuals), refactor allows to capture the variation in cell-type composition, which was shown to be a dominant sparse signal in methylation data.

Usage

refactor(X, k, sparsity = 500, C = NULL, C.remove = FALSE,
  sd_threshold = 0.02, num_comp = NULL, rand_svd = FALSE,
  log_file = "TCA.log", debug = FALSE)

Arguments

An m by n matrix of measurements of m features for n observations. Each column in X is assumed to be a mixture of k different sources. Note that X must include row names and column names and that NA values are currently not supported.

A numeric value indicating the dimension of the signal in the data (i.e. the number of sources).

sparsity

A numeric value indicating the sparsity of the signal in the data (the number of signal rows).

An n by p design matrix of covariates that will be accounted for in the feature selection step. Note that C must include row names and column names and that NA values are currently not supported; ; set C to be NULL if there are no such covariates.

C.remove

A logical value indicating whether the covariates in X should be accounted for not only in the feature selection step, but also in the final calculation of the principal components (i.e. if C.remove == TRUE then the selected features will be adjusted for the covariates in C prior to calculating principal components). Note that setting C.remove to be TRUE is desired when ReFACTor is intended to be used for correction in downstream analysis, whereas setting C.remove to be FALSE is desired when ReFACTor is merely used for capturing the sparse signals in the data (i.e. regardless of correction).

sd_threshold

A numeric value indicating a standard deviation threshold to be used for excluding low-variance features in X (i.e. features with standard deviation lower than sd_threshold will be excluded). Set sd_threshold to be NULL for turning off this filter. Note that removing features with very low variability tends to improve speed and performance.

num_comp

A numeric value indicating the number of ReFACTor components to return.

rand_svd

A logical value indicating whether to use random svd for estimating the low-rank structure of the data in the first step of the algorithm; random svd can result in a substantial speedup for large data.

log_file

A path to an output log file. Note that if the file log_file already exists then logs will be appended to the end of the file. Set log_file to NULL to prevent output from being saved into a file.

debug

A logical value indicating whether to set the logger to a more detailed debug level; please set debug to TRUE before reporting issues.

Value

A list with the estimated components of the ReFACTor model.

scores

An n by num_comp matrix of the ReFACTor components (the projection scores).

coeffs

A sparsity by num_comp matrix of the coefficients of the ReFACTor components (the projection loadings).

ranked_list

A vector with the features in the data, ranked by their scores in the feature selection step of the algorithm; the top scoring features (set according to the argument sparsity) are used for calculating the ReFACTor components. Note that features that were excluded according to sd_threshold will not appear in this ranked_list.

Details

ReFACTor is a two-step algorithm for sparse principal component analysis (PCA) under a row-sparse model. The algorithm performs an unsupervised feature selection by ranking the features based on their correlation with their values under a low-rank representation of the data, followed by a calculation of principal components using the top ranking features (ReFACTor components).

Note that ReFACTor is tuned towards capturing sparse signals of the dominant sources of variation in the data. Therefore, in the presence of other potentially dominant factors in the data (i.e. beyond the variation of interest), these factors should be accounted for by including them as covariates (see argument C). In cases where the ReFACTor components are designated to be used as covariates in a downstream analysis alongside the covariates in C (e.g., in a standard regression analysis or in a TCA regression), it is advised to set the argument C.remove to be TRUE. This will adjust the selected features for the information in C prior to the calculation of the ReFACTor components, which will therefore capture only signals that is not present in C (and as a result may benefit the downstream analysis by potentially capturing more signals beyond the information in C).

References

Rahmani E, Zaitlen N, Baran Y, Eng C, Hu D, Galanter J, Oh S, Burchard EG, Eskin E, Zou J, Halperin E. Sparse PCA corrects for cell type heterogeneity in epigenome-wide association studies. Nature Methods 2016.

Rahmani E, Zaitlen N, Baran Y, Eng C, Hu D, Galanter J, Oh S, Burchard EG, Eskin E, Zou J, Halperin E. Correcting for cell-type heterogeneity in DNA methylation: a comprehensive evaluation. Nature Methods 2017.

Examples

Run this code

# NOT RUN {
data <- test_data(100, 200, 3, 0, 0, 0.01)
ref <- refactor(data$X, k = 3, sparsity = 50)

# }

Run the code above in your browser using DataLab