diffTest: Generic Differential Test

Description

diffTest is a generic function used to perform a differential test (or multiple differential tests) between two R objects, usually of the same type. Described in this page is the method designed for comparing two bioCond objects. This method is typically used to call genomic intervals with differentially represented ChIP-seq signals between two biological conditions.

Usage

diffTest(x, y, ...)
# S3 method for bioCond
diffTest(x, y, min.var = 0, df.prior = NULL, ...)

Value

This method returns an object of class

c("diffBioCond", "data.frame"), recording the test results for each genomic interval by each row. The data frame consists of the following variables:

x.mean, y.mean: Mean signal intensities of the two conditions, respectively. "x" and "y" in the variable names are replaced by the corresponding actual condition names.
Mval: Difference in mean signal intensity between the two conditions. An Mval of 1 indicates a twofold change in normalized read count.
Mval.se: Standard error associated with the Mval.
Mval.t: The ratio of Mval to Mval.se.
pval: Two sided p-value for the statistical significance of this signal difference.
padj: P-value adjusted for multiple testing with the "BH" method (see p.adjust), which controls false discovery rate.

Row names of the returned data frame inherit from those of

x$norm.signal. Besides, an attribute named "Mval.se.df" is added to the returned object, which is a positive numeric giving the total number of degrees of freedom associated with the standard errors.

Arguments

x, y: x is any R object for which a diffTest method has been defined. For the method for class "bioCond", x and y are two bioCond objects to be compared. They must be associated with the same mean-variance curve (i.e., they must have the same "mvcID"; see also fitMeanVarCurve).
...: Arguments passed to specific methods or from other methods.
min.var: Lower bound of variances read from the mean-variance curve. Any variance read from the curve less than min.var will be adjusted to this value. It's primarily used for safely getting the prior variances and taking into account the practical significance of a signal difference.
df.prior: Number of prior degrees of freedom associated with the mean-variance curve. Must be non-negative. Can be set to Inf (see "Details"). By default, diffTest checks if x and y have the same "df.prior" component, and uses it as the number of prior degrees of freedom if yes (an error is raised otherwise).

Details

This method for calling differential genomic intervals between two bioCond objects adopts the modeling strategy implemented in limma (see "References"), except that each interval has its own prior variance, which is read from the mean-variance curve associated with the bioConds. Technically, the final estimate of variance for an individual interval is a weighted average between its prior and observed variances, with the weights being proportional to their respective numbers of degrees of freedom.

Two extreme values can be specified for the argument df.prior (number of degrees of freedom associated with the prior variances), representing two distinct cases: when it is set to 0, the final variance estimate for an individual interval is simply deduced from the signal intensities observed in it, and the statistical test reduces to the ordinary two-sample t-test; when it is set to Inf, the final variance estimate is simply read from the mean-variance curve. Other values of df.prior represent intermediate cases. To be noted, the number of prior degrees of freedom is automatically estimated for each mean-variance curve by a specifically designed statistical method (see also fitMeanVarCurve and setMeanVarCurve) and, by default, diffTest uses the estimation result to perform the differential tests. It's highly not recommended to specify df.prior explicitly when calling diffTest, unless you know what you are really doing. Besides, diffTest won't adjust variance ratio factors of the two bioConds being compared based on the specified number of prior degrees of freedom (see estimatePriorDf for a description of variance ratio factor).

Note also that, if df.prior is set to 0, of the two bioCond objects being compared there must be at least one that contains two or more samples. Otherwise, there is no way to measure the variance associated with each interval, and an error is raised.

Considering the practical significance of differential ChIP-seq signals, those genomic intervals not occupied by either of the conditions may be filtered out before selecting differential ones. Thus, the statistical power for detecting differential intervals could potentially be increased by re-adjusting p-values of the remaining intervals (see "Examples" below).

References

Smyth, G.K., Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol, 2004. 3: p. Article3.

Tu, S., et al., MAnorm2 for quantitatively comparing groups of ChIP-seq samples. Genome Res, 2021. 31(1): p. 131-145.

Examples

Run this code

data(H3K27Ac, package = "MAnorm2")
attr(H3K27Ac, "metaInfo")
# \donttest{
## Make a comparison between GM12891 and GM12892 cell lines.

# Perform MA normalization and construct bioConds to represent the two cell
# lines.
norm <- normalize(H3K27Ac, 5:6, 10:11)
norm <- normalize(norm, 7:8, 12:13)
conds <- list(GM12891 = bioCond(norm[5:6], norm[10:11], name = "GM12891"),
              GM12892 = bioCond(norm[7:8], norm[12:13], name = "GM12892"))
autosome <- !(H3K27Ac$chrom %in% c("chrX", "chrY"))
conds <- normBioCond(conds, common.peak.regions = autosome)

# Variations in ChIP-seq signals across biological replicates of a cell line
# are generally of a low level, and their relationship with the mean signal
# intensities is expected to be well modeled by the presumed parametric
# form.
conds <- fitMeanVarCurve(conds, method = "parametric", occupy.only = TRUE)
summary(conds[[1]])
plotMeanVarCurve(conds, subset = "occupied")

# Perform differential tests between the two cell lines.
res1 <- diffTest(conds[[1]], conds[[2]])
head(res1)
MAplot(res1, padj = 0.001)
abline(h = 0, lwd = 2, lty = 5, col = "green3")

## Make a comparison between GM12891 and GM12892 cell lines using only their
## first replicates.

# Perform MA normalization and construct bioConds to represent the two cell
# lines.
autosome <- !(H3K27Ac$chrom %in% c("chrX", "chrY"))
norm <- normalize(H3K27Ac, c(5, 7), c(10, 12),
                  common.peak.regions = autosome)
conds <- list(GM12891 = bioCond(norm[5], norm[10], name = "GM12891"),
              GM12892 = bioCond(norm[7], norm[12], name = "GM12892"))

# Construct a "blind" bioCond that treats the two samples as replicates and
# fit a mean-variance curve accordingly. Only common peak regions of the two
# samples are considered to be occupied by the "blind" bioCond, and only
# these regions are used for fitting the mean-variance curve. This setting
# is for capturing underlying non-differential intervals as accurately as
# possible and avoiding over-estimation of prior variances (i.e., variances
# read from a mean-variance curve).
conds$blind <- bioCond(norm[c(5, 7)], norm[c(10, 12)], occupy.num = 2,
                       name = "blind")
conds <- fitMeanVarCurve(conds, method = "parametric",
                         occupy.only = TRUE, init.coef = c(0.1, 10))
summary(conds[[1]])
summary(conds[[2]])
summary(conds[[3]])
plotMeanVarCurve(conds, subset = "occupied")

# Perform differential tests between the two cell lines.
res2 <- diffTest(conds[[1]], conds[[2]])
head(res2)
MAplot(res2, pval = 0.01)
abline(h = 0, lwd = 2, lty = 5, col = "green3")

# Inspect only the test results of the genomic intervals that are occupied
# by at least one of the two bioConds having been compared. Note the
# globally increased statistical power.
res3 <- res2[conds[[1]]$occupancy | conds[[2]]$occupancy, ]
res3$padj <- p.adjust(res3$pval, method = "BH")
boxplot(list(all = res2$padj, occupied = res3$padj), ylab = "Adj. p-value")

## Examine the consistency of results between the two differential analyses.

# Theoretically, t-statistics resulting from the two differential analyses
# are not directly comparable to each other, since they have different
# numbers of degrees of freedom. Here we map these t-statistics to the
# standard normal distribution in such a manner that the resulting
# z-statistics correspond to the same p-values as do the original
# t-statistics.
z1 <- qnorm(res1$pval / 2)
z1[res1$Mval > 0] <- -z1[res1$Mval > 0]
z2 <- qnorm(res2$pval / 2)
z2[res2$Mval > 0] <- -z2[res2$Mval > 0]

# Check the correlation between z-statistics from the two differential
# analyses.
cor(z1, z2)
cor(z1, z2, method = "sp")
# }
## Make a comparison between the male and female genders by treating each
## individual (i.e., cell line) as a replicate.
# \donttest{
# First perform the MA normalization and construct bioConds to represent
# individuals.
norm <- normalize(H3K27Ac, 4, 9)
norm <- normalize(norm, 5:6, 10:11)
norm <- normalize(norm, 7:8, 12:13)
conds <- list(GM12890 = bioCond(norm[4], norm[9], name = "GM12890"),
              GM12891 = bioCond(norm[5:6], norm[10:11], name = "GM12891"),
              GM12892 = bioCond(norm[7:8], norm[12:13], name = "GM12892"))
autosome <- !(H3K27Ac$chrom %in% c("chrX", "chrY"))
conds <- normBioCond(conds, common.peak.regions = autosome)

# Group individuals into bioConds based on their genders.
female <- cmbBioCond(conds[c(1, 3)], name = "female")
male <- cmbBioCond(conds[2], name = "male")

# The dependence of variance of ChIP-seq signal intensity across individuals
# on the mean signal intensity is not as regular as in the case for modeling
# biological replicates of cell lines. Better use the local regression to
# adaptively capture the mean-variance trend.
genders <- list(female = female, male = male)
genders <- fitMeanVarCurve(genders, method = "local", occupy.only = FALSE)
genders <- estimatePriorDf(genders, occupy.only = TRUE)
summary(genders$female)
summary(genders$male)
plotMeanVarCurve(genders, subset = "all")

# Perform differential tests between the two genders.
res <- diffTest(genders[[1]], genders[[2]])
head(res)
MAplot(res, pval = 0.01)
abline(h = 0, lwd = 2, lty = 5, col = "green3")

# Examine the distribution of p-values in Y chromosome.
hist(res$pval[H3K27Ac$chrom == "chrY"], col = "red",
     main = "P-values in Y chromosome")
# }

Run the code above in your browser using DataLab