io: Import and export

Description

The functions import and export load and save objects from and to particular file formats. The rtracklayer package implements support for a number of annotation and sequence formats.

Usage

export(object, con, format, ...)
import(con, format, text, ...)

Arguments

object

The object to export.

con

The connection from which data is loaded or to which data is saved. If this is a character vector, it is assumed to be a filename and a corresponding file connection is created and then closed after exporting the object. If a RTLFile derivative, the data is loaded from or saved to the underlying resource. If missing, the function will return the output as a character vector, rather than writing to a connection.

format

The format of the output. If missing and con is a filename, the format is derived from the file extension. This argument is unnecessary when con is a derivative of RTLFile.

text

If con is missing, this can be a character vector directly providing the string data to import.

...

Parameters to pass to the format-specific method.

Value

If con is missing, a character vector containing the string output. Otherwise, nothing is returned.

Details

The rtracklayer package supports a number of file formats for representing annotated genomic intervals. These are each represented as a subclass of RTLFile. Below, we list the major supported formats, with some advice for when a particular file format is appropriate:

GFF: The General Feature Format is meant to represent any set of genomic features, with application-specific columns represented as “attributes”. There are three principal versions (1, 2, and 3). This is a good format for interoperating with other genomic tools and is the most flexible format, in that a feature may have any number of attributes (in version 2 and above). Version 3 (GFF3) is the preferred version. Its specification lays out conventions for representing various types of data, including gene models, for which it is the format of choice. For variants, rtracklayer has rudimentary support for an extention of GFF3 called GVF. UCSC supports GFF1, but it needs to be encapsulated in the UCSC metaformat, i.e. export.ucsc(subformat = "gff1"). The BED format is typically preferred over GFF for interaction with UCSC. GFF files can be indexed with the tabix utility for fast range-based queries via rtracklayer and Rsamtools.
BED: The Browser Extended Display format is for displaying qualitative tracks in a genome browser, in particular UCSC. It finds a good balance between simplicity and expressiveness. It is much simpler than GFF and yet can still represent multi-exon gene structures. It is somewhat limited by its lack of the attribute support of GFF. To circumvent this, many tools and organizations have extended BED with additional columns. These are not officially valid BED files, and as such rtracklayer does not yet support them (this will be addressed soon). The rtracklayer package does support two official extensions of BED: Bed15 and bedGraph, and the unofficial BEDPE format, see below. BED files can be indexed with the tabix utility for fast range-based queries via rtracklayer and Rsamtools.
Bed15: An extension of BED with 15 columns, Bed15 is meant to represent data from microarray experiments. Multiple samples/columns are supported, and the data is displayed in UCSC as a compact heatmap. Few other tools support this format. With 15 columns per feature, this format is probably too verbose for e.g. ChIP-seq coverage (use multiple BigWig tracks instead).
bedGraph: A variant of BED that represents a score column more compactly than BED and especially Bed15, although only one sample is supported. The data is displayed in UCSC as a bar or line graph. For large data (the typical case), BigWig is preferred.
bedGraph: A variant of BED that represents pairs of genomic regions, such as interaction data or chromosomal rearrangements. The data cannot be displayed in UCSC directly but can be represented using the BED12 format.
WIG: The Wiggle format is meant for storing dense numerical data, such as window-based GC and conservation scores. The data is displayed in UCSC as a bar or line graph. The WIG format only works for intervals with a uniform width. For non-uniform widths, consider bedGraph. For large data, consider BigWig.
BigWig: The BigWig format is a binary version of both bedGraph and WIG (which are now somewhat obsolete). A BigWig file contains a spatial index for fast range-based queries and also embeds summary statistics of the scores at several zoom levels. Thus, it is ideal for visualization of and parallel computing on genome-scale vectors, like the coverage from a high-throughput sequencing experiment.

In summary, for the typical use case of combining gene models with experimental data, GFF is preferred for gene models and BigWig is preferred for quantitative score vectors. Note that the Rsamtools package provides support for the BAM file format (for representing read alignments), among others. Based on this, the rtracklayer package provides an export method for writing GAlignments and GappedReads objects as BAM. For variants, consider VCF, supported by the VariantAnnotation package.

There is also support for reading and writing biological sequences, including the UCSC TwoBit format for compactly storing a genome sequence along with a mask. The files are binary, so they are efficiently queried for particular ranges. A similar format is FA, supported by Rsamtools.

Examples

Run this code

  track <- import(system.file("tests", "v1.gff", package = "rtracklayer"))
  ## Not run: export(track, "my.gff", version = "3")
  ## equivalently,
  ## Not run: export(track, "my.gff3")
  ## or
  ## Not run: 
#   con <- file("my.gff3")
#   export(track, con, "gff3")
#   close(con)
#   ## End(Not run)
  ## or as a string
  export(track, format = "gff3")

Run the code above in your browser using DataLab