normalizeToMatrix: Normalize associations between genomic signals and target regions into a matrix

Description

Normalize associations between genomic signals and target regions into a matrix

Usage

normalizeToMatrix(signal, target, extend = 5000, w = max(extend)/50,
    value_column = NULL, mapping_column = NULL, empty_value = ifelse(smooth, NA, 0),
    mean_mode = c("absolute", "weighted", "w0", "coverage"), include_target = any(width(target) > 1),
    target_ratio = ifelse(all(extend == 0), 1, 0.1), k = min(c(20, min(width(target)))),
    smooth = FALSE, smooth_fun = default_smooth_fun, trim = 0)

Arguments

signal

a GRanges object.

target

a GRanges object.

extend

extended base pairs to the upstream and downstream of target. It can be a vector of length one or two. If it is length one, it means extension to the upstream and downstream are the same.

window size for splitting upstream and downstream.

value_column

column index in signal that will be mapped to colors. If it is NULL, an internal column which all contains 1 will be used.

mapping_column

mapping column to restrict overlapping between signal and target. By default it tries to look for all regions in signal that overlap with every target.

empty_value

values for small windows that don't overlap with signal.

mean_mode

when a window is not perfectly overlapped to signal, how to summarize values to this window. See 'Details' section for a detailed explanation.

include_target

whether include target in the heatmap. If the width of all regions in target is 1, include_target is enforced to FALSE.

target_ratio

the ratio of target in the full heatmap. If the value is 1, extend will be reset to 0.

number of windows only when target_ratio = 1 or extend == 0, otherwise ignored.

smooth

whether apply smoothing on rows in the matrix.

smooth_fun

the smoothing function that is applied to each row in the matrix. This self-defined function accepts a numeric vector (may contains NA values) and returns a vector with same length. If the smoothing is failed, the function should call stop to throw errors so that normalizeToMatrix can catch how many rows are failed in smoothing. See the default default_smooth_fun for example.

trim

percent of extreme values to remove. IF it is a vector of length 2, it corresponds to the lower quantile and higher quantile. e.g. c(0.01, 0.01) means to trim outliers less than 1st quantile and larger than 99th quantile.

Value

A matrix with following additional attributes:
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
The matrix is wrapped into a simple normalizeToMatrix class.

Details

In order to visualize associations between signal and target, the data is transformed into a matrix and visualized as a heatmap by EnrichedHeatmap afterwards.

Upstream and downstream also with the target body are splitted into a list of small windows and overlap to signal. Since regions in signal and small windows do not always 100 percent overlap, there are four different average modes:

Following illustrates different settings for mean_mode (note there is one signal region overlapping with other signals):

40 50 20 values in signal ++++++ +++ +++++ signal 30 values in signal ++++++ signal ================= window (17bp), there are 4bp not overlapping to any signal region. 4 6 3 3 overlap

absolute: (40 + 30 + 50 + 20)/4 weighted: (40*4 + 30*6 + 50*3 + 20*3)/(4 + 6 + 3 + 3) w0: (40*4 + 30*6 + 50*3 + 20*3)/(4 + 6 + 3 + 3 + 4) coverage: (40*4 + 30*6 + 50*3 + 20*3)/17

To explain it more clearly, let's consider three scenarios:

First, we want to calculate mean methylation from 3 CpG sites in a 20bp window. Since methylation is only measured at CpG site level, the mean value should only be calculated from the 3 CpG sites while not the non-CpG sites. In this case, absolute mode should be used here.

Second, we want to calculate mean coverage in a 20bp window. Let's assume coverage is 5 in 1bp ~ 5bp, 10 in 11bp ~ 15bp and 20 in 16bp ~ 20bp. Since converage is kind of attribute for all bases, all 20 bp should be taken into account. Thus, here w0 mode should be used which also takes account of the 0 coverage in 6bp ~ 10bp. The mean coverage will be caculated as (5*5 + 10*5 + 20*5)/(5+5+5+5).

Third, genes have multiple transcripts and we want to calculate how many transcripts eixst in a certain position in the gene body. In this case, values associated to each transcript are binary (either 1 or 0) and coverage mean mode should be used.

Examples

Run this code

signal = GRanges(seqnames = "chr1", 
	  ranges = IRanges(start = c(1, 4, 7, 11, 14, 17, 21, 24, 27),
                     end = c(2, 5, 8, 12, 15, 18, 22, 25, 28)),
    score = c(1, 2, 3, 1, 2, 3, 1, 2, 3))
target = GRanges(seqnames = "chr1", ranges = IRanges(start = 10, end = 20))
normalizeToMatrix(signal, target, extend = 10, w = 2)
normalizeToMatrix(signal, target, extend = 10, w = 2, include_target = TRUE)
normalizeToMatrix(signal, target, extend = 10, w = 2, value_column = "score")

Run the code above in your browser using DataLab