UCSCTableQuery-class: Querying UCSC Tables

Description

The UCSC genome browser is backed by a large database, which is exposed by the Table Browser web interface. Tracks are stored as tables, so this is also the mechanism for retrieving tracks. The UCSCTableQuery class represents a query against the Table Browser. Storing the query fields in a formal class facilitates incremental construction and adjustment of a query.

Arguments

Constructor

: ucscTableQuery(x, track, range = genome(x), table = NULL, names = NULL): Creates a UCSCTableQuery with the UCSCSession given as x and the track name given by the single string track. range should be a genome string identifier, a GRanges instance or RangesList instance, and it effectively defaults to genome(x). If the genome is missing, it is taken from the session. The table name is given by table, which may be a single string or NULL. Feature names, such as gene identifiers, may be passed via names as a character vector.

Executing Queries

Below, object is a UCSCTableQuery instance.

: track(object): Retrieves the indicated table as a track, i.e. a GRanges object. Note that not all tables are available as tracks.
: getTable(object): Retrieves the indicated table as a data.frame. Note that not all tables are output in parseable form, and that UCSC will truncate responses if they exceed certain limits (usually around 100,000 records). The safest (and most efficient) bet for large queries is to download the file via FTP and query it locally.
: tableNames(object): Gets the names of the tables available for the session, track and range specified by the query.

Accessor methods

In the code snippets below, x/object is a UCSCTableQuery object.

: browserSession(object), browserSession(object) <- value: Get or set the UCSCSession to query.
: trackName(x), trackName(x) <- value: Get or set the single string indicating the track containing the table of interest.
: trackNames(x)
: tableName(x), tableName(x) <- value: Get or set the single string indicating the name of the table to retrieve. May be NULL, in which case the table is automatically determined.
: range(x), range(x) <- value: Get or set the GRanges indicating the portion of the table to retrieve in genomic coordinates. Any missing information, such as the genome identifier, is filled in using range(browserSession(x)). It is also possible to set the genome identifier string or a RangesList.
: names(x), names(x) <- value: Get or set the names of the features to retrieve. If NULL, this filter is disabled.
: ucscSchema(x): Get the UCSCSchema object describing the selected table.

Details

There are five supported fields for a table query:

session: The UCSCSession instance from the tables are retrieved. Although all sessions are based on the same database, the set of user-uploaded tracks, which are represented as tables, is not the same, in general.
trackName: The name of a track from which to retrieve a table. Each track can have multiple tables. Many times there is a primary table that is used to display the track, while the other tables are supplemental. Sometimes, tracks are displayed by aggregating multiple tables.
tableName: The name of the specific table to retrieve. May be NULL, in which case the behavior depends on how the query is executed, see below.
range: A genome identifier, a GRanges or a RangesList indicating the portion of the table to retrieve, in genome coordinates. Simply specifying the genome string is the easiest way to download data for the entire genome, and GRangesForUCSCGenome facilitates downloading data for e.g. an entire chromosome.
names: Names/accessions of the desired features

A common workflow for querying the UCSC database is to create an instance of UCSCTableQuery using the ucscTableQuery constructor, invoke tableNames to list the available tables for a track, and finally to retrieve the desired table either as a data.frame via getTable or as a track via track. See the examples.

The reason for a formal query class is to facilitate multiple queries when the differences between the queries are small. For example, one might want to query multiple tables within the track and/or same genomic region, or query the same table for multiple regions. The UCSCTableQuery instance can be incrementally adjusted for each new query. Some caching is also performed, which enhances performance.

Examples

Run this code

## Not run: 
# session <- browserSession()
# genome(session) <- "mm9"
# trackNames(session) ## list the track names
# ## choose the Conservation track for a portion of mm9 chr1
# query <- ucscTableQuery(session, "Conservation",
#                         GRangesForUCSCGenome("mm9", "chr12",
#                                              IRanges(57795963, 57815592)))
# ## list the table names
# tableNames(query)
# ## get the phastCons30way track
# tableName(query) <- "phastCons30way"
# ## retrieve the track data
# track(query)  # a GRanges object
# ## get a data.frame summarizing the multiple alignment
# tableName(query) <- "multiz30waySummary"
# getTable(query)
# 
# genome(session) <- "hg18"
# query <- ucscTableQuery(session, "snp129",
#                         names = c("rs10003974", "rs10087355", "rs10075230"))
# ucscSchema(query)
# getTable(query)
# ## End(Not run)

Run the code above in your browser using DataLab