# DepLogo -- plotting dependency logos of sequence data

## Plotting dependency logos from data

In the following, we show how to plot dependency logos from data. As an example data set, we consider binding sites of the human transcription factor c-Jun predicted in ChIP-seq positive peaks.

First, we load the data, which are stored in a tabular file containing the sequences in the first column and the peak statistics of the corresponding ChIP-seq peaks in the second column:

seqs <- read.table(system.file("extdata", "cjun.txt", package = "DepLogo"), stringsAsFactors = FALSE) head(seqs)

Before plotting dependency logos, we need to construct a DLData object from our data. A DLData object is a wrapper for the sequences (internally represented as a data.frame with sequences already split into their symbols), optional weights (here, the peak statistics), and the annotation of the underlying alphabet and corresponding colors for visualization. The default alphabet is the DNA alphabet. We may call the DLData constructor just with sequences and weights, in this case. The default delimiter between symbols is the empty string (splitting sequences into individual symbols), but may be changed to other values.

data <- DLData(sequences = seqs[, 1],weights = log1p(seqs[, 2]) )

The implementation of summary for DLData objects returns a list containing the number of sequences, the consensus of all sequences, and the number of sequences that are identical to the consensus sequence.

We then use the resulting DLData object to plot a first dependency logo:

 plotDeplogo(data = data)

By default, dependencies between sequence positions are represented by shaded arcs. The intensity of an arc is chosen based on the mutual information between the two connected positions. Partitions of the input sequence set are represented by position-wise, colored rectangles. This is basically the same kind of visualization that we also used in the introductory part of this vignette.

Instead of mutual information values, we may also use the corresponding p-values (see above):

plotDeplogo(data = data, show.dependency.pvals = TRUE)

The default arc representations applies a cutoff on the mutual information values prior to visualization.

If we want to illustrate all pairwise dependency values, we may use a matrix of dependency values, instead:

plotDeplogo(data = data, dep.fun = plotDepmatrix, show.dependency.pvals = TRUE)

In general, all major parts of the dependency logo (visualization of dependency strength at the top, visualization of the dependency structure in the middle, and summary of all the sequences at the bottom) may be customized. For this purpose, you can either use one of the pre-defined functions (see package documentation) or you can implement your own plotting functions as described below.

In the first of the following examples, we explicitly assign the default values to each of the plotting parameters, while in the second example, we visualize dependency structures by color charts, instead.

 plotDeplogo(data = data, dep.fun = plotDepmatrix, block.fun = deprects, summary.fun = logo) plotDeplogo(data = data, dep.fun = plotDepmatrix, block.fun = colorchart, summary.fun = logo)

In some cases, it might also be reasonable to plot a dependency logo of the reverse complementary sequences. For this purpose, we provide a function revcom accepting a DLData object and returning a DLData object containing the reverse complementary sequences. The revcom function works only on DLData objects defined on DNA or RNA alphabets, which may include gaps. Providing DLData objects defined on other alphabets, the function call will result in an error.

data.rc <- revcom(data) plotDeplogo(data) plotDeplogo(data.rc)

### Tuning parameters

The partitioning of the data and, consequently, the visualization by dependency logos, is influenced by different parameters of the plotDeplogo call. Specifically,

• threshold defines the threshold on the mutual information values. Positions with (average) mutual information values above this value are considered for partitioning the data by the symbols at these positions.
• minPercent defines the minimum size of a partition as the fraction of the full set of sequences assigned to its own partition.
• maxNum defines the maximum number of recursive splits.

While the default parameters of plotDeplogo will yield an acceptable result on most cases, specific data sets and applications may require some tuning to achieve a visually pleasing result.

The gradient of resulting dependency logos may span from cluttered to overly aggregated as illustrated in the examples below.

plotDeplogo(data = data, threshold = 0.01, minPercent = 0.01, maxNum = 8)
plotDeplogo(data = data, threshold = 0.04)
plotDeplogo(data = data)
plotDeplogo(data = data, threshold = 1.0)

In the first dependency logo, partitioning yields very small, highly specific partitions, which also include spurious dependencies. The second and third dependency logos may be acceptable, depending on the application. The fourth dependency logo does not partition the data at all and, hence, does not add any information compared with a traditional sequence logo.

### Alphabets beyond DNA

While dependency logos originally have been designed for visualizing dependencies in DNA sequences (i.e., transcription factor binding sites), their implementation in the DepLogo package also allows for using other alphabets, for instance, RNA nucleotides or amino acids in protein sequences, and handles gap symbols (-). However, only upper-case letters are supported.

Below, we show an example of a dependency logo of protein sequences, specifically sequences of the glycosyl transferase family as downloaded from PFAM (Accession PF02885.16). For proteins, we first define the amino acid alphabet and assign colors to each of the symbols (amino acids).

 prot <- c("I", "L", "V", "F", "M", "C", "A", "G", "P", "T", "S", "Y", "W", "Q", "N", "H", "E", "D", "K", "R", "B", "Z", "X","-") colors <- rainbow(length(prot))

A pre-defined alphabet is also available in the variable alphabet.protein.gap:

alphabet.protein.gap

We then load protein sequences from the file obtained from PFAM

dat <- read.table(system.file("extdata", "Glycos_trans_3N.ann", package = "DepLogo"), stringsAsFactors = FALSE, fill = TRUE) dat <- dat[-nrow(dat), 2] dat <- toupper(gsub(pattern = "\\.", "-", dat))

which are slightly processed (all symbols converted to upper case, dots replaced by dashes) to be suitable for the DLData constructor.

We then create a DLData object from those sequences, specifying the custom alphabet and the corresponding colors:

data <- DLData(sequences = dat, symbols = prot, colors = colors)

The alignment loaded from PFAM contains a large number of columns (positions)

dim(data)

which are too many to be visualized by dependency logos. However, many of these positions mostly contain gap symbols, which might be less interesting for a visualization.

For this purpose, the DepLogo package contains a function filterColumns that removes all columns based on a user-supplied filter function. Here, we filter positions with a user-defined fraction of gap symbols from the alignment. This function results in a filtered DLData object. The resulting data have an acceptable number of columns and may be processed by plotDeplogo:

fun <- filter.by.gaps(0.1) dat2 <- filterColumns(data = data, filter.fun = fun) dim(dat2)

In addition, the DepLogo package contains another function suggestColors that proposes distinguishable colors for each of the symbols. The idea is to chose different colors for symbols i) often occurring at the same position, especially ii) at those positions with the largest dependencies to other positions. The suggested colors may then be set for a DLData object using the replaceColors function:

data2 <- replaceColors(dat2, suggestColors(dat2))

We may then plot a dependency logo for the modified DLData object:

plotDeplogo(data2, threshold = 0.6)

Although colors have been chosen to be distinguishable, it may still be hard to recognize the individual symbols in the mixed colors of the dependency logo for large alphabets. In such cases, it may be useful to replace the standard, compact visualization of partitions by traditional, partition-wise sequence logos as in the following example.

plotDeplogo(data2, threshold = 0.6, block.fun = logo)

## Adding plots of sequence weights/values

For many biological sequences, we may also obtain values associated with individual sequences in the data set. For instance, this may be peak statistics or binding affinities in case of transcription factor binding sites, or enzyme efficiencies in case of protein sequences.

Such values may be visualized for the partitions shown in a dependency logo in different ways. In the following example, we again consider the c-Jun binding sites together with the corresponding ChIP-seq peak statistics:

seqs <- read.table(system.file("extdata", "cjun.txt", package = "DepLogo"), stringsAsFactors = FALSE) data <- DLData(sequences = seqs[, 1], weights = log1p(seqs[, 2]) )

Here, the weights in each of the partitions generated by plotDeplogo are represented by boxplots:

plotDeplogo(data = data, weight.fun = subBoxes)

In this case, the different partitions show similar distributions of the associated weights and, hence, may be considered as similarly functional binding sites.

In the same manner, we may consider binding sites of Nrsf/REST:

seqs <- read.table(system.file("extdata", "nrsf.txt", package = "DepLogo"), stringsAsFactors = FALSE) data <- DLData(sequences = seqs[,1], weights = log1p(seqs[,2]) ) plotDeplogo(data = data, threshold = 0.03, weight.fun = subBoxes)

Alternatively, weights may represented by line plots indicating the median value by a red vertical line:

plotDeplogo(data = data, threshold = 0.03, weight.fun = subLines)

Here, we find that binding sites showing a larger number of conserved positions (as shown at the top of the central part of the dependency logo) show substantially larger weights (peak statistics) than the remaining binding sites.

In some cases, it might even be beneficial to first partition the data by sequence weights (e.g., high-scoring, mid-scoring, low-scoring) and plot individual representations of the symbol-based sub-partitions. In this example, we create individual plots for the top 1000 binding sites, the following 2000 binding sites, and all remaining binding sites. We also assign different heights to each of these sub-plots.

plotDeplogo(data = data, weight.fun = subBoxes, chunks = c(1000, 2000, length(data) - 3000), chunk.height = c(200, 200, 700))

The resulting dependency logo makes the relationship between sequence conservation and peak statistics even more obvious. However, the picture is also slightly cluttered, which may be adjusted for by tuning the partitioning parameters as described above.

## Components of dependency logos and customization

The building blocks of dependency logos may also be used separately, for instance in other, custom plots. In the following, we show some examples of these building blocks and comment on the possibility of custom visualizations that may be plugged into plotDeplogo.

### Visualizing pairwise dependencies

Using the functions plotDeparcs and plotDepmatrix, only the pairwise dependencies (without the actual dependency structure on the sequence level) may be visualized. These functions also work on DLData objects and their use is demonstrated in the following examples:

seqs <- read.table(system.file("extdata", "cjun.txt", package = "DepLogo"), stringsAsFactors = FALSE) data <- DLData(sequences = seqs[, 1] ) plotDeparcs(data) plotDepmatrix(data)

Again, these functions have further parameters that allow for some level of customization. For instance, the axis may be shown at the top of the plot (mirroring the corresponding plot at the x-axis), and p-values instead of mutual information values may be shown:

plotDepmatrix(data, axis.at.bottom = FALSE, show.pvals = TRUE)

Implementations of custom functions for visualizing dependency values may also be plugged into the plotDeplogo function. The only requirement is to have the same signature as the existing implementations plotDeparcs and plotDepmatrix.

### Visualizing partitions of sequence data

We may also use the visualization of the partitioning of sequences independently of the plotDeplogo method.

For this purpose, we may first partition the data by the partition function, which is also used internally by plotDeplogo and, again, has several parameters influencing the partitioning result (see documentation):

 seqs <- read.table(system.file("extdata", "cjun.txt", package = "DepLogo"), stringsAsFactors = FALSE) data <- DLData(sequences = seqs[, 1], weights = log1p( seqs[, 2] ) ) partitions <- partition(data, threshold = 0.3, minElements = 100)

Calling summary on this list of partitions returns a data.frame with columns for the total number of sequence, the consensus sequence, and the number of sequences that are identical to the consensus sequence, and with one row per partition.

The resulting partitioning may then be visualized by the default method of dependency logos, by color charts, or by separate sequence logos:

plotBlocks(data = partitions) plotBlocks(data = partitions, block.fun = colorchart) plotBlocks(data = partitions, block.fun = logo)

The user may also chose to switch off the scaling by "information content",

plotBlocks(data = partitions, ic.scale = FALSE)

which results in brown-ish colors at less conserved positions in the dependency logo visualization.

Again, the low-level plotting function deprects, colorchart, and logo may be replaced by custom implementation in both, plotBlocks and plotDeplogo. Here, the functions need to have the same signature as the current implementations and, in addition, return the vertical (y) offset after plotting the current partition. Further, the total height of visualizations of all partitions must have a height that is equal to the total number of sequences in the union of all these partitions because of the corresponding ylim specification in the high-level plot created by plotBlocks.

### Visualizing summaries of sequence data

For the summary visualization of the complete data set (parameter summary.fun), plotDeplogo internally also uses the plotBlocks function but with a single DLData object. Hence, the summary visualization may be customized in the same manner as the visualization of partitions explained above.