tcareg: Fitting a TCA regression model

Description

TCA regression allows to test for several types of statistical relations between source-specific values and an outcome of interest. For example, in the context of tissue-level bulk DNA methylation data coming from a mixture of cell types (i.e. the input is methylation sites by individuals), tcareg allows to test for cell-type-specific effects of methylation on an outcome of interest.

Usage

tcareg(X, tca.mdl, y, C3 = NULL, test = "marginal",
  null_model = NULL, alternative_model = NULL, save_results = TRUE,
  output = "TCA", sort_results = TRUE, parallel = FALSE,
  num_cores = NULL, log_file = "TCA.log", features_metadata = NULL,
  debug = FALSE)

Arguments

An m by n matrix of measurements of m features for n observations. Each column in X is assumed to be a mixture of k different sources. Note that X must include row names and column names and that NA values are currently not supported.

tca.mdl

The value returned by applying the function tca to X.

An n by 1 matrix of an outcome of interest for each of the n observations in X. Note that y must include row names and column names and that NA values are currently not supported.

An n by p3 design matrix of covariates that may affect y. Note that C3 must include row names and column names and should not include an intercept term. NA values are currently not supported.

test

A character vector with the type of test to perform on each of the features in X; one of the following options: 'marginal', 'marginal_conditional', 'joint', 'single_effect', or 'custom'. Setting 'marginal' or 'marginal_conditional' corresponds to testing each feature in X for a statistical relation between y and each of the k sources separately; for any particular source under test, the marginal_conditional option further accounts for possible effects of the rest of the k-1 sources ('marginal' will therefore tend to be more powerful in discovering truly related features, but at the same time more prone to falsely tagging the correct related sources if sources are highly correlated). Setting 'joint' or 'single_effect' corresponds to testing each feature for an overall statistical relation with y, while modeling source-specific effects; the latter option further assumes that the source-specific effects are the same within each feature ('single_effect' means only one degree of freedom and will therefore be more powerful when the assumption of a single effect within a feature holds). Finally, 'custom' corresponds to testing each feature in X for a statistical relation with y under a user-specified model (alternative model) with respect to a null model (null model); for example, for testing for relation of the combined (potentially different) effects of features 1 and 2 while accounting for the (potentially different) effects of 3 and 4, set the null model to be sources 3, 4 and the alternative model to be sources 1, 2, 3, 4. Indicating that null_model assumes no effect for any of the sources can be done by setting it to NULL.

null_model

A vector with a subset of the names of the sources in tca.mdl$W to be used as a null model (activated only if test == 'custom'). Note that the null model must be nested within the alternative model; set null_model to be NULL for indicating no effect for any of the sources under the null model.

alternative_model

A vector with a subset (or all) of the names of the sources in tca.mdl$W to be used as an alternative model (activated only if test == 'custom').

save_results

A logical value indicating whether to save the returned results in a file. If TRUE and test == 'marginal' or test == 'marginal_conditional' then k files will be saved (one for the results of each source).

output

Prefix for output files (activated only if save_results == TRUE).

sort_results

A logical value indicating whether to sort the results by their p-value (i.e. features with lower p-value will appear first in the results).

parallel

A logical value indicating whether to use parallel computing (possible when using a multi-core machine).

num_cores

A numeric value indicating the number of cores to use (activated only if parallel == TRUE). If num_cores == NULL then all available cores except for one will be used.

log_file

A path to an output log file. Note that if the file log_file already exists then logs will be appended to the end of the file. Set log_file to NULL to prevent output from being saved into a file.

features_metadata

A path to a csv file containing metadata about the features in X that will be added to the output files (activated only if save_results == TRUE). Each row in the metadata file should correspond to one feature (with the row name being the feature identifier, as it appears in the rows of X) and each column should correspond to one metadata descriptor (with an appropriate column name). Features that do not exist in X will be ignored and features in X with missing metadata information will show missing values.

debug

A logical value indicating whether to set the logger to a more detailed debug level; please set debug to TRUE before reporting issues.

Value

A list with the results of applying the TCA regression model to each of the features in the data. If test == 'marginal' or test == 'marginal_conditional' then a list of k such lists of results are returned, one for the results of each source.

phi

An estimate of the standard deviation of the i.i.d. component of variation in the TCA regression model.

beta

A matrix of effect size estimates for the source-specific effects, such that each row corresponds to the estimated effect sizes of one feature. The number of columns corresponds to the number of estimated effects (e.g., if test is set to marginal then beta will include a single column, if test is set to joint then beta will include k columns and so on).

intercept

An m by 1 matrix of estimates for the intercept of each feature.

alpha

An m by p3 matrix of effect size estimates for the p3 covariates in C3, such that each row corresponds to the estimated effect sizes of one feature.

null_ll

An m by 1 matrix of the log-likelihood of the model under the null hypothesis.

alternative_ll

An m by 1 matrix of the log-likelihood of the model under the alternative hypothesis.

stats

An m by 1 matrix of the LRT statistic for each feature in the data.

The degrees of freedom for deriving p-values using LRT.

pvals

An m by 1 matrix of the p-value for each feature in the data.

qvals

An m by 1 matrix of the q-value (FDR-adjusted p-values) for each feature in the data.

Details

TCA models $Z_{hj}^i$ as the source-specific value of observation $i$ in feature $j$ coming from source $h$ (see tca for more details). A TCA regression model tests an outcome $Y$ for a linear statistical relationwith the source-specific values of a feature $j$ by assuming: $$Y_i = \sum_{h=1}^k \beta_{hj} Z_{hj}^i + e_i$$ where $e_i \sim N(0,\phi^2)$. In practice, tcareg fits this model using the conditional distribution $Y|X$, which, effectively, integrates over the latent $Z_{hj}^i$ parameters. Statistical significance is then calculated using a likelihood ratio test (LRT). Note that the null and alternative models will be set automatically, except when test == 'custom', in which case they will be set according to the user-specified null and alternative hypotheses.

Under the TCA regression model, several statistical tests can be performed by setting the argument test according to one of the following options.

1. If test == 'marginal', tcareg will perform the following for each source $l$. For each feature $j$, $\beta_{lj}$ will be estimated and tested for a non-zero effect, while assuming $\beta_{hj}=0$ for all other sources $h\neq l$.

2. If test == 'marginal_conditional', tcareg will perform the following for each source $l$. For each feature $j$, $\beta_{lj}$ will be estimated and tested for a non-zero effect, while also estimating the effect sizes $\beta_{hj}$ for all other sources $h\neq l$.

3. If test == 'joint', tcareg will estimate for each feature $j$ the effect sizes of all $k$ sources $\beta_{1j},<U+2026>,\beta_{kj}$ and then test the set of $k$ estimates of each feature j for a joint effect.

4. If test == 'single_effect', tcareg will estimate for each feature $j$ the effect sizes of all $k$ sources $\beta_{1j},<U+2026>,\beta_{kj}$, under the assumption that $\beta_{1j} = <U+2026> = \beta_{kj}$, and then test the set of $k$ estimates of each feature j for a joint effect.

5. If test == 'custom', tcareg will estimate for each feature $j$ the effect sizes of a predefined set of sources (defined by a user-specified alternative model) and then test their estimates for a joint effect, while accounting for a nested predefined set of sources (defined by a user-specified null model).

References

Rahmani E, Schweiger R, Rhead B, Criswell LA, Barcellos LF, Eskin E, Rosset S, Sankararaman S, Halperin E. Cell-type-specific resolution epigenetics without the need for cell sorting or single-cell biology. Nature Communications 2018.

Examples

Run this code

# NOT RUN {
n <- 50
m <- 10
k <- 3
p1 <- 1
p2 <- 1
data <- test_data(n, m, k, p1, p2, 0.01)
tca.mdl <- tca(data$X, data$W, data$C1, data$C2)
y <- matrix(rexp(n, rate=.1), ncol=1)
# joint test:
res1 <- tcareg(data$X, tca.mdl, y, test = "joint", save_results = FALSE)
# custom test, testing for a joint effect of sources 1,2 while accounting for source 3
res2 <- tcareg(data$X, tca.mdl, y, test = "custom", null_model = c("3"),
alternative_model = c("1","2","3"), save_results = FALSE)
# custom test, testing for a joint effect of sources 1,2 assuming no effects under the null
res3 <- tcareg(data$X, tca.mdl, y, test = "custom", null_model = NULL,
alternative_model = c("1","2"), save_results = FALSE)

# }

Run the code above in your browser using DataLab