summarizeByAnnotation: Summarize data based on genome annotation.

Description

This function creates a summarization of columns of the data using specified SQLite functions, applying these summarization function to regions defined in an annotation data frame.

Usage

summarizeByAnnotation(expData, annoData, what = getColnames(expData, all = FALSE), fxs = c("TOTAL"), groupBy = NULL, splitBy = NULL, ignoreStrand = FALSE, bindAnno = FALSE, preserveColnames = TRUE, verbose = getOption("verbose"))

Arguments

expData

An object of class ExpData.

annoData

A data frame which must contain the columns chr, start, end and strand which specifies annotation regions of interest.

what

Vector of names of data columns to be summarized.

fxs

Vector of strings giving the names of SQLite functions to call on the data column(s).

groupBy

Character vector refering to a column in annoData. Regions will be aggregated over distinct values of this column. Setting this argument will set bindAnno to TRUE. If splitBy is set, meta.id will override.

splitBy

String indicating column of annoData object on which to split results.

ignoreStrand

Logical indicating whether strand should be taken into account in aggregation. If TRUE strand will be ignored.

bindAnno

Logical indicating whether annotation information should be included in the output.

preserveColnames

Logical indicating whether column names should be preserved. Only possible when a single function is being applied.

verbose

Logical indicating whether details should be printed.

Value

If splitBy is not specified, returns a data frame containing results of aggregation functions performed on each region defined in annoData. If splitBy is specified, returns a list of data frames with one entry for each unique value of the column which was split on.

Details

Most of the computation is done using SQLite. Depending on the use case, this approach may be significantly faster and use much less memory than the alternative: use splitByAnnotation to retrieve a list with all the data and then use R to summarize over each element of the list. It is (naturally) constrained to the use of operations expressible in (SQLite) SQL.

If meta.id is set to a column in annoData, all regions with the same value of the meta.id will be joined together; a standard use case is labelleing exons of a gene.

References

The SQLite website http://www.sqlite.org/lang_aggfunc.html has details on what mathematical functions are implemented.

Examples

Run this code

ed <- ExpData(system.file(package = "Genominator", "sample.db"),
              tablename = "raw")
data("yeastAnno")
summarizeByAnnotation(ed, yeastAnno[1:50,])

Run the code above in your browser using DataLab