Fits the TCA model for an input matrix of observations coming from a mixture of k
sources, under the assumption that each observation is a mixture of unique source-specific values (in each feature in the data). For example, in the context of tissue-level bulk DNA methylation data coming from a mixture of cell types (i.e. the input is methylation sites by individuals), tca
allows to model the methylation of each individual as a mixture of cell-type-specific methylation levels that are unique to the individual.
tca(X, W, C1 = NULL, C2 = NULL, refit_W = FALSE,
refit_W.features = NULL, refit_W.sparsity = 500,
refit_W.sd_threshold = 0.02, parallel = FALSE, num_cores = NULL,
max_iters = 10, log_file = "TCA.log", debug = FALSE)
An m
by n
matrix of measurements of m
features for n
observations. Each column in X
is assumed to be a mixture of k
different sources. Note that X
must include row names and column names and that NA values are currently not supported.
An n
by k
matrix of weights - the weights of k
sources for each of the n
mixtures (observations). All the weights must be positive and each row, corresponding to the weights of a single observation, must sum up to 1. Note that W
must include row names and column names and that NA values are currently not supported. In case where only initial estimates of W
are available, tca
can be set to re-estimate W
(see refit_W
).
An n
by p1
design matrix of covariates that may affect the hidden source-specific values (possibly a different effect on each source). Note that C1
must include row names and column names and should not include an intercept term. NA values are currently not supported.
An n
by p2
design matrix of covariates that may affect the mixture (i.e. rather than directly the sources of the mixture; for example, variables that capture biases in the collection of the measurements). Note that C2
must include row names and column names and should not include an intercept term. NA values are currently not supported.
A logical value indicating whether to re-estimate the input W
under the TCA model.
A vector with the names of the features in X
to consider when re-estimating W
(i.e. when refit_W == TRUE
). This is useful since oftentimes just a subset of the features in X
will be informative for estimating W
. If refit_W.features == NULL
then the ReFACTor algorithm will be used for performing feature selection (see also refit_W.sparsity, refit_W.sd_threshold
).
A numeric value indicating the number of features to select using the ReFACTor algorithm when re-estimating W
(activated only if refit_W == TRUE
and refit_W.features == NULL
). Note that refit_W.sparsity
must be lower or equal to the number of features in X
. For more information, see the argument sparsity
in refactor.
A numeric value indicating a standard deviation threshold to be used for excluding low-variance features in X
(activated only if refit_W == TRUE
and refit_W.features == NULL
). For more information, see the argument sd_threshold
in refactor.
A logical value indicating whether to use parallel computing (possible when using a multi-core machine).
A numeric value indicating the number of cores to use (activated only if parallel == TRUE
). If num_cores == NULL
then all available cores except for one will be used.
A numeric value indicating the maximal number of iterations to use in the optimization of the TCA model (max_iters
iterations will be used as long as the optimization does not converge in earlier iterations).
A path to an output log file. Note that if the file log_file
already exists then logs will be appended to the end of the file. Set log_file
to NULL
to prevent output from being saved into a file.
A logical value indicating whether to set the logger to a more detailed debug level; please set debug
to TRUE
before reporting issues.
A list with the estimated parameters of the model. This list can be then used as the input to other functions such as tcareg
.
An n
by k
matrix of weights. If refit_W == TRUE
then this is the re-estimated W
; otherwise this is the input W
An m
by k
matrix of estimates for the mean of each source in each feature.
An m
by k
matrix of estimates for the standard deviation of each source in each feature.
An estimate of the standard deviation of the i.i.d. component of variation in X
.
An m
by k*p1
matrix of the estimated effects of the p1
factors in C1
on each of the m
features in X
, where the first p1
columns are the source-specific effects of the p1
factors on the first source, the following p1
columns are the source-specific effects on the second source and so on.
An m
by p2
matrix of the estimated effects of the p2
factors in C2
on the mixture values of each of the m
features in X
.
The TCA model assumes that the hidden source-specific values are random variables. Formally, denote by \(Z_{hj}^i\) the source-specific value of observation \(i\) in feature \(j\) source \(h\), the TCA model assumes: $$Z_{hj}^i \sim N(\mu_{hj},\sigma_{hj}^2)$$ where \(\mu_{hj},\sigma_{hj}\) represent the mean and standard deviation that are specific to feature \(j\) source \(h\). The model further assumes that the observed value of observation \(i\) in feature \(j\) is a mixture of \(k\) different sources: $$X_{ji} = \sum_{h=1}^k W_{ih}Z_{hj}^i + \epsilon_{ji}$$ where \(W_{ih}\) is the non-negative proportion of source \(h\) in the mixture of observation \(i\) such that \(\sum_{h=1}^kW_{ih} = 1\), and \(\epsilon_{ji} \sim N(0,\tau^2)\) is an i.i.d. component of variation that models measurement noise. Note that the mixture proportions in \(W\) are, in general, unique for each individual, therefore each entry in the data matrix \(X\) is coming from a unique distribution (i.e. a different mean and a different variance).
In cases where the true W
is unknown, tca
can be provided with initial estimates of W
and then re-estimate W
as part of the optimization procedure (see argument refit_W
). These initial estimates should not be random but rather capture the information in W
to some extent. When the argument refit_W
is used, it is typically the case that only a subset of the features should be used for re-estimating W
. Therefore, when re-estimating W
, tca
performs feature selection using the ReFACTor algorithm; alternatively, it can also be provided with a user-specified list of features to be used in the re-estimation (see argument refit_W.features
).
Factors that systematically affect the source-specific values \(Z_{hj}^i\) can be further considered (see argument C1
). In that case, we assume: $$Z_{hj}^i \sim N(\mu_{hj}+c^{(1)}_i \gamma_j^h,\sigma_{hj}^2)$$ where \(c^{(1)}_i\) is a row vector from C1
, corresponding to the values of the \(p_1\) factors for observation \(i\), and \(\gamma_j^h\) is a vector of \(p_1\) corresponding effect sizes.
Factors that systematically affect the mixture values \(X_{ji}\), such as variables that capture biases in the collection of the measurements, can also be considered (see argument C2
). In that case, we assume: $$X_{ji} \sim \sum_{h=1}^k W_{ih}Z_{hj}^i + c^{(2)}_i \delta_j + \epsilon_{ij}$$ where \(c^{(2)}_i\) is a row vector from C2
, corresponding to the values of the \(p_2\) factors for observation \(i\), and \(\delta_j\) is a vector of \(p_2\) corresponding effect sizes.
Rahmani E, Schweiger R, Rhead B, Criswell LA, Barcellos LF, Eskin E, Rosset S, Sankararaman S, Halperin E. Cell-type-specific resolution epigenetics without the need for cell sorting or single-cell biology. Nature Communications 2018.
Rahmani E, Zaitlen N, Baran Y, Eng C, Hu D, Galanter J, Oh S, Burchard EG, Eskin E, Zou J, Halperin E. Sparse PCA corrects for cell type heterogeneity in epigenome-wide association studies. Nature Methods 2016.
# NOT RUN {
data <- test_data(100, 20, 3, 1, 1, 0.01)
tca.mdl <- tca(data$X, data$W, data$C1, data$C2)
# }
Run the code above in your browser using DataLab