enrichedPairs: Compute local enrichment for bin pairs

Description

Calculate the log-fold increase in abundance for each bin pair against its local neighborhood.

Usage

enrichedPairs(data, flank=5, exclude=0, prior.count=2, abundances=NULL)

Arguments

data

an InteractionSet object containing bin pair counts, generated by squareCounts

flank

an integer scalar, specifying the number of bins to consider as the local neighborhood

exclude

an integer scalar, specifying the number of bins to exclude from the neighborhood

prior.count

a numeric scalar indicating the prior count to use in computing the log-fold increase

abundances

a numeric vector of abundances for each bin pair

Value

A numeric vector containing the log-fold increase (i.e., enrichment value) for each bin pair in data.

Definition of the neighborhoods

Consider the coordinates of the interaction space in terms of bins, and focus on any particular bin pair (named here as the target bin pair). This target bin pair is characterized by four neighborhood regions, from A to D. Region A is a square with side lengths equal to flank*2+1, where the target bin pair is positioned in the center. Region B is a square with side lengths equal to flank, positioned such that the target bin pair lies at the corner furthest from the diagonal (only used for intra-chromosomal targets). Region C is a horizontal rectangle with dimensions (1, flank*2+1), containing the target bin pair at the center. Region D is the vertical counterpart to C. Obviously, the target bin pair itself is excluded in the definition of each neighborhood. If exclude is positive, additional bin pairs closest to the target will also be excluded. For example, region A* is constructed with exclude instead of flank, and the resulting area is excluded from region A (and so on for all other regions). This avoids problems where diffuse interactions are imperfectly captured by the target bin pair, such that genuine interactions spill over into the neighborhood. Spill-over is undesirable as it will inflate the neighborhood abundance for genuine interactions. Setting a larger exclude ensures that this does not occur. The size of flank requires consideration, as it defines the size of each neighborhood region. If the value is too large, other peaks may be included in the background such that the neighborhood abundance is inflated. On the other hand, if flank is too small, there will not be enough neighborhood bin pairs to dilute the increase in abundance from spill-over. Both scenarios result in a decrease in enrichment values and loss of power to detect punctate events. The default value of 5 seems to work well, though users may wish to test several values for themselves.

Computing the enrichment values

For a target bin pair in data, the enrichedPairs function computes the mean abundance for each of its surrounding neighborhoods. This is defined as the mean of the counts for all constituent bin pairs in that neighborhood (average counts are used for multiple libraries). The local background for the target bin pair is defined as the maximum of the mean abundances for all neighborhoods. The enrichment value is then defined as the the difference between the target bin pair's abundance and its local background. The idea is that bin pairs with high enrichments are likely to represent punctate interactions between clearly defined loci. Selecting for high enrichments can then select for these peak-like features in the interaction space. The maximizing strategy is designed to mitigate the effects of structural features. Region B will capture the high interaction intensity within genomic domains like TADs, while the C and D will capture any bands in the interaction space. The abundance will be high for any neighborhood that captures a high-intensity feature, as the average counts will be large for all bin pairs within the features. This will then be chosen as the maximum during calculation of enrichment values. Otherwise, if only region A were used, the background abundance would be decreased by low-intensity bin pairs outside of the features. This results in spuriously high enrichment values for target bin pairs on the feature boundaries. By default, nothing is done to adjust for the effect of distance on abundance for intra-chromosomal bin pairs. This is because the counts are generally too low to routinely fit a reliable trend. That said, users can still supply distance-adjusted abundances as abundances. Such values can be defined as the residuals of the fit from filterTrended. Obviously, no such work is required for inter-chromosomal bin pairs.

References

Rao S et al. (2014). A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell. 159, 1665-1690.

Examples

Run this code

# Setting up the object.
a <- 10
b <- 20
regions <- GRanges(rep(c("chrA", "chrB"), c(a, b)), IRanges(c(1:a, 1:b), c(1:a, 1:b)))

set.seed(23943)
all.anchor1 <- sample(length(regions), 50, replace=TRUE)
all.anchor2 <- as.integer(runif(50, 1, all.anchor1+1))
data <- InteractionSet(matrix(rnbinom(200, mu=10, size=10), 50, 4), 
    GInteractions(anchor1=all.anchor1, anchor2=all.anchor2, 
        regions=regions, mode="reverse"), 
    colData=DataFrame(lib.size=1:4*1000), metadata=List(width=1))
data$totals <- colSums(assay(data))

# Getting peaks.
head(enrichedPairs(data))
head(enrichedPairs(data, flank=3))
head(enrichedPairs(data, flank=1))
head(enrichedPairs(data, exclude=1))

# Accounting for distance.
filtered <- suppressWarnings(filterTrended(data, prior.count=0))
adj.ab <- filtered$abundances - filtered$threshold

head(enrichedPairs(data, abundances=adj.ab))

Run the code above in your browser using DataLab