tca: Fitting the TCA model

Description

Fits the TCA model for an input matrix of observations coming from a mixture of k sources, under the assumption that each observation is a mixture of unique source-specific values (in each feature in the data). For example, in the context of tissue-level bulk DNA methylation data coming from a mixture of cell types (i.e. the input is methylation sites by individuals), tca allows to model the methylation of each individual as a mixture of cell-type-specific methylation levels that are unique to the individual.

Usage

tca(X, W, C1 = NULL, C2 = NULL, refit_W = FALSE,
  refit_W.features = NULL, refit_W.sparsity = 500,
  refit_W.sd_threshold = 0.02, parallel = FALSE, num_cores = NULL,
  max_iters = 10, log_file = "TCA.log", debug = FALSE)

Arguments

An m by n matrix of measurements of m features for n observations. Each column in X is assumed to be a mixture of k different sources. Note that X must include row names and column names and that NA values are currently not supported.

An n by k matrix of weights - the weights of k sources for each of the n mixtures (observations). All the weights must be positive and each row, corresponding to the weights of a single observation, must sum up to 1. Note that W must include row names and column names and that NA values are currently not supported. In case where only initial estimates of W are available, tca can be set to re-estimate W (see refit_W).

An n by p1 design matrix of covariates that may affect the hidden source-specific values (possibly a different effect on each source). Note that C1 must include row names and column names and should not include an intercept term. NA values are currently not supported.

An n by p2 design matrix of covariates that may affect the mixture (i.e. rather than directly the sources of the mixture; for example, variables that capture biases in the collection of the measurements). Note that C2 must include row names and column names and should not include an intercept term. NA values are currently not supported.

refit_W

A logical value indicating whether to re-estimate the input W under the TCA model.

refit_W.features

A vector with the names of the features in X to consider when re-estimating W (i.e. when refit_W == TRUE). This is useful since oftentimes just a subset of the features in X will be informative for estimating W. If refit_W.features == NULL then the ReFACTor algorithm will be used for performing feature selection (see also refit_W.sparsity, refit_W.sd_threshold).

refit_W.sparsity

A numeric value indicating the number of features to select using the ReFACTor algorithm when re-estimating W (activated only if refit_W == TRUE and refit_W.features == NULL). Note that refit_W.sparsity must be lower or equal to the number of features in X. For more information, see the argument sparsity in refactor.

refit_W.sd_threshold

A numeric value indicating a standard deviation threshold to be used for excluding low-variance features in X (activated only if refit_W == TRUE and refit_W.features == NULL). For more information, see the argument sd_threshold in refactor.

parallel

A logical value indicating whether to use parallel computing (possible when using a multi-core machine).

num_cores

A numeric value indicating the number of cores to use (activated only if parallel == TRUE). If num_cores == NULL then all available cores except for one will be used.

max_iters

A numeric value indicating the maximal number of iterations to use in the optimization of the TCA model (max_iters iterations will be used as long as the optimization does not converge in earlier iterations).

log_file

A path to an output log file. Note that if the file log_file already exists then logs will be appended to the end of the file. Set log_file to NULL to prevent output from being saved into a file.

debug

A logical value indicating whether to set the logger to a more detailed debug level; please set debug to TRUE before reporting issues.

Value

A list with the estimated parameters of the model. This list can be then used as the input to other functions such as tcareg.

An n by k matrix of weights. If refit_W == TRUE then this is the re-estimated W; otherwise this is the input W

mus_hat

An m by k matrix of estimates for the mean of each source in each feature.

sigmas_hat

An m by k matrix of estimates for the standard deviation of each source in each feature.

tau_hat

An estimate of the standard deviation of the i.i.d. component of variation in X.

gammas_hat

An m by k*p1 matrix of the estimated effects of the p1 factors in C1 on each of the m features in X, where the first p1 columns are the source-specific effects of the p1 factors on the first source, the following p1 columns are the source-specific effects on the second source and so on.

deltas_hat

An m by p2 matrix of the estimated effects of the p2 factors in C2 on the mixture values of each of the m features in X.

Details

The TCA model assumes that the hidden source-specific values are random variables. Formally, denote by $Z_{hj}^i$ the source-specific value of observation $i$ in feature $j$ source $h$, the TCA model assumes: $$Z_{hj}^i \sim N(\mu_{hj},\sigma_{hj}^2)$$ where $\mu_{hj},\sigma_{hj}$ represent the mean and standard deviation that are specific to feature $j$ source $h$. The model further assumes that the observed value of observation $i$ in feature $j$ is a mixture of $k$ different sources: $$X_{ji} = \sum_{h=1}^k W_{ih}Z_{hj}^i + \epsilon_{ji}$$ where $W_{ih}$ is the non-negative proportion of source $h$ in the mixture of observation $i$ such that $\sum_{h=1}^kW_{ih} = 1$, and $\epsilon_{ji} \sim N(0,\tau^2)$ is an i.i.d. component of variation that models measurement noise. Note that the mixture proportions in $W$ are, in general, unique for each individual, therefore each entry in the data matrix $X$ is coming from a unique distribution (i.e. a different mean and a different variance).

In cases where the true W is unknown, tca can be provided with initial estimates of W and then re-estimate W as part of the optimization procedure (see argument refit_W). These initial estimates should not be random but rather capture the information in W to some extent. When the argument refit_W is used, it is typically the case that only a subset of the features should be used for re-estimating W. Therefore, when re-estimating W, tca performs feature selection using the ReFACTor algorithm; alternatively, it can also be provided with a user-specified list of features to be used in the re-estimation (see argument refit_W.features).

Factors that systematically affect the source-specific values $Z_{hj}^i$ can be further considered (see argument C1). In that case, we assume: $$Z_{hj}^i \sim N(\mu_{hj}+c^{(1)}_i \gamma_j^h,\sigma_{hj}^2)$$ where $c^{(1)}_i$ is a row vector from C1, corresponding to the values of the $p_1$ factors for observation $i$, and $\gamma_j^h$ is a vector of $p_1$ corresponding effect sizes.

Factors that systematically affect the mixture values $X_{ji}$, such as variables that capture biases in the collection of the measurements, can also be considered (see argument C2). In that case, we assume: $$X_{ji} \sim \sum_{h=1}^k W_{ih}Z_{hj}^i + c^{(2)}_i \delta_j + \epsilon_{ij}$$ where $c^{(2)}_i$ is a row vector from C2, corresponding to the values of the $p_2$ factors for observation $i$, and $\delta_j$ is a vector of $p_2$ corresponding effect sizes.

References

Rahmani E, Schweiger R, Rhead B, Criswell LA, Barcellos LF, Eskin E, Rosset S, Sankararaman S, Halperin E. Cell-type-specific resolution epigenetics without the need for cell sorting or single-cell biology. Nature Communications 2018.

Rahmani E, Zaitlen N, Baran Y, Eng C, Hu D, Galanter J, Oh S, Burchard EG, Eskin E, Zou J, Halperin E. Sparse PCA corrects for cell type heterogeneity in epigenome-wide association studies. Nature Methods 2016.

Examples

Run this code

# NOT RUN {
data <- test_data(100, 20, 3, 1, 1, 0.01)
tca.mdl <- tca(data$X, data$W, data$C1, data$C2)

# }

Run the code above in your browser using DataLab