TCA regression allows to test for several types of statistical relations between source-specific values and an outcome of interest. For example, in the context of tissue-level bulk DNA methylation data coming from a mixture of cell types (i.e. the input is methylation sites by individuals), tcareg
allows to test for cell-type-specific effects of methylation on an outcome of interest.
tcareg(X, tca.mdl, y, C3 = NULL, test = "marginal",
null_model = NULL, alternative_model = NULL, save_results = TRUE,
output = "TCA", sort_results = TRUE, parallel = FALSE,
num_cores = NULL, log_file = "TCA.log", features_metadata = NULL,
debug = FALSE)
An m
by n
matrix of measurements of m
features for n
observations. Each column in X
is assumed to be a mixture of k
different sources. Note that X
must include row names and column names and that NA values are currently not supported.
The value returned by applying the function tca
to X
.
An n
by 1 matrix of an outcome of interest for each of the n
observations in X
. Note that y
must include row names and column names and that NA values are currently not supported.
An n
by p3
design matrix of covariates that may affect y
. Note that C3
must include row names and column names and should not include an intercept term. NA values are currently not supported.
A character vector with the type of test to perform on each of the features in X
; one of the following options: 'marginal'
, 'marginal_conditional'
, 'joint'
, 'single_effect'
, or 'custom'
. Setting 'marginal'
or 'marginal_conditional'
corresponds to testing each feature in X
for a statistical relation between y
and each of the k
sources separately; for any particular source under test, the marginal_conditional
option further accounts for possible effects of the rest of the k-1
sources ('marginal'
will therefore tend to be more powerful in discovering truly related features, but at the same time more prone to falsely tagging the correct related sources if sources are highly correlated). Setting 'joint'
or 'single_effect'
corresponds to testing each feature for an overall statistical relation with y
, while modeling source-specific effects; the latter option further assumes that the source-specific effects are the same within each feature ('single_effect'
means only one degree of freedom and will therefore be more powerful when the assumption of a single effect within a feature holds). Finally, 'custom'
corresponds to testing each feature in X
for a statistical relation with y
under a user-specified model (alternative model) with respect to a null model (null model); for example, for testing for relation of the combined (potentially different) effects of features 1 and 2 while accounting for the (potentially different) effects of 3 and 4, set the null model to be sources 3, 4 and the alternative model to be sources 1, 2, 3, 4. Indicating that null_model
assumes no effect for any of the sources can be done by setting it to NULL
.
A vector with a subset of the names of the sources in tca.mdl$W
to be used as a null model (activated only if test == 'custom'
). Note that the null model must be nested within the alternative model; set null_model
to be NULL
for indicating no effect for any of the sources under the null model.
A vector with a subset (or all) of the names of the sources in tca.mdl$W
to be used as an alternative model (activated only if test == 'custom'
).
A logical value indicating whether to save the returned results in a file. If TRUE
and test == 'marginal'
or test == 'marginal_conditional'
then k
files will be saved (one for the results of each source).
Prefix for output files (activated only if save_results == TRUE
).
A logical value indicating whether to sort the results by their p-value (i.e. features with lower p-value will appear first in the results).
A logical value indicating whether to use parallel computing (possible when using a multi-core machine).
A numeric value indicating the number of cores to use (activated only if parallel == TRUE
). If num_cores == NULL
then all available cores except for one will be used.
A path to an output log file. Note that if the file log_file
already exists then logs will be appended to the end of the file. Set log_file
to NULL
to prevent output from being saved into a file.
A path to a csv file containing metadata about the features in X
that will be added to the output files (activated only if save_results == TRUE
). Each row in the metadata file should correspond to one feature (with the row name being the feature identifier, as it appears in the rows of X
) and each column should correspond to one metadata descriptor (with an appropriate column name). Features that do not exist in X
will be ignored and features in X
with missing metadata information will show missing values.
A logical value indicating whether to set the logger to a more detailed debug level; please set debug
to TRUE
before reporting issues.
A list with the results of applying the TCA regression model to each of the features in the data. If test == 'marginal'
or test == 'marginal_conditional'
then a list of k
such lists of results are returned, one for the results of each source.
An estimate of the standard deviation of the i.i.d. component of variation in the TCA regression model.
A matrix of effect size estimates for the source-specific effects, such that each row corresponds to the estimated effect sizes of one feature. The number of columns corresponds to the number of estimated effects (e.g., if test
is set to marginal
then beta
will include a single column, if test
is set to joint
then beta
will include k
columns and so on).
An m
by 1
matrix of estimates for the intercept of each feature.
An m
by p3
matrix of effect size estimates for the p3
covariates in C3
, such that each row corresponds to the estimated effect sizes of one feature.
An m
by 1
matrix of the log-likelihood of the model under the null hypothesis.
An m
by 1
matrix of the log-likelihood of the model under the alternative hypothesis.
An m
by 1
matrix of the LRT statistic for each feature in the data.
The degrees of freedom for deriving p-values using LRT.
An m
by 1
matrix of the p-value for each feature in the data.
An m
by 1
matrix of the q-value (FDR-adjusted p-values) for each feature in the data.
TCA models \(Z_{hj}^i\) as the source-specific value of observation \(i\) in feature \(j\) coming from source \(h\) (see tca for more details). A TCA regression model tests an outcome \(Y\) for a linear statistical relationwith the source-specific values of a feature \(j\) by assuming: $$Y_i = \sum_{h=1}^k \beta_{hj} Z_{hj}^i + e_i$$ where \(e_i \sim N(0,\phi^2)\). In practice, tcareg
fits this model using the conditional distribution \(Y|X\), which, effectively, integrates over the latent \(Z_{hj}^i\) parameters. Statistical significance is then calculated using a likelihood ratio test (LRT). Note that the null and alternative models will be set automatically, except when test == 'custom'
, in which case they will be set according to the user-specified null and alternative hypotheses.
Under the TCA regression model, several statistical tests can be performed by setting the argument test
according to one of the following options.
1. If test == 'marginal'
, tcareg
will perform the following for each source \(l\). For each feature \(j\), \(\beta_{lj}\) will be estimated and tested for a non-zero effect, while assuming \(\beta_{hj}=0\) for all other sources \(h\neq l\).
2. If test == 'marginal_conditional'
, tcareg
will perform the following for each source \(l\). For each feature \(j\), \(\beta_{lj}\) will be estimated and tested for a non-zero effect, while also estimating the effect sizes \(\beta_{hj}\) for all other sources \(h\neq l\).
3. If test == 'joint'
, tcareg
will estimate for each feature \(j\) the effect sizes of all \(k\) sources \(\beta_{1j},<U+2026>,\beta_{kj}\) and then test the set of \(k\) estimates of each feature j
for a joint effect.
4. If test == 'single_effect'
, tcareg
will estimate for each feature \(j\) the effect sizes of all \(k\) sources \(\beta_{1j},<U+2026>,\beta_{kj}\), under the assumption that \(\beta_{1j} = <U+2026> = \beta_{kj}\), and then test the set of \(k\) estimates of each feature j
for a joint effect.
5. If test == 'custom'
, tcareg
will estimate for each feature \(j\) the effect sizes of a predefined set of sources (defined by a user-specified alternative model) and then test their estimates for a joint effect, while accounting for a nested predefined set of sources (defined by a user-specified null model).
Rahmani E, Schweiger R, Rhead B, Criswell LA, Barcellos LF, Eskin E, Rosset S, Sankararaman S, Halperin E. Cell-type-specific resolution epigenetics without the need for cell sorting or single-cell biology. Nature Communications 2018.
# NOT RUN {
n <- 50
m <- 10
k <- 3
p1 <- 1
p2 <- 1
data <- test_data(n, m, k, p1, p2, 0.01)
tca.mdl <- tca(data$X, data$W, data$C1, data$C2)
y <- matrix(rexp(n, rate=.1), ncol=1)
# joint test:
res1 <- tcareg(data$X, tca.mdl, y, test = "joint", save_results = FALSE)
# custom test, testing for a joint effect of sources 1,2 while accounting for source 3
res2 <- tcareg(data$X, tca.mdl, y, test = "custom", null_model = c("3"),
alternative_model = c("1","2","3"), save_results = FALSE)
# custom test, testing for a joint effect of sources 1,2 assuming no effects under the null
res3 <- tcareg(data$X, tca.mdl, y, test = "custom", null_model = NULL,
alternative_model = c("1","2"), save_results = FALSE)
# }
Run the code above in your browser using DataLab