
"import"(con, format, text, version = c("", "1", "2", "3"), genome = NA, colnames = NULL, which = NULL, feature.type = NULL, sequenceRegionsAsSeqinfo = FALSE)
import.gff(con, ...)
import.gff1(con, ...)
import.gff2(con, ...)
import.gff3(con, ...)
"export"(object, con, format, ...)
"export"(object, con, format, version = c("1", "2", "3"), source = "rtracklayer", append = FALSE, index = FALSE)
"export"(object, con, format, ...)
export.gff(object, con, ...)
export.gff1(object, con, ...)
export.gff2(object, con, ...)
export.gff3(object, con, ...)
GFFFile
object. For the
functions ending in .gff
, .gff1
, etc, the file format
is indicated by the function name. For the base export
and
import
functions, the format must be indicated another
way. If con
is a path, URL or connection, either the file
extension or the format
argument needs to be one of
“gff”, “gff1” “gff2”, “gff3”,
“gvf”, or “gtf”. Compressed files (“gz”,
“bz2” and “xz”) are handled transparently.
GRanges
or
something coercible to a GRanges
. If the object has a method
for asGFF
, it is called prior to coercion. This makes it
possible to export a GRangesList
or TxDb
in a
way that preserves the hierarchical structure. For exporting
multiple tracks, in the UCSC track line metaformat, pass a
GenomicRangesList
, or something coercible to one.
gff-version
directive in the file or “1” if none), “1”, “2”
or “3”.
con
is missing, a character vector to use as the
input.
NA
if
unknown. Typically, this is a UCSC identifier like “hg19”. An
attempt will be made to derive the seqinfo
on the return
value using either an installed BSgenome package or UCSC, if network
access is available.
source
or
type
, or, for GFF2 and GFF3, any attribute.
GRanges
or other range-based object supported
by findOverlaps
. Only the intervals in the file
overlapping the given ranges are returned. This is much more efficient
when the file is indexed with the tabix utility.
NULL
(the default) or a character vector of
valid feature types. If not NULL
, then only the features of the
specified type(s) are imported.
TRUE
, attempt to infer the
Seqinfo
(seqlevels
and seqlengths
) from the
“##sequence-region” directives as specified by GFF3.
TRUE
, automatically compress and index the
output file with bgzf and tabix. Note that tabix indexing will
sort the data by chromosome and start. Tabix supports a
single track in a file.
TRUE
, and con
points to a file path,
the data is appended to the file. Obviously, if con
is a
connection, the data is always appended.
GFFFile
method on
import
. When trackLine
is
TRUE
or the target format is BED15, the arguments are passed
through export.ucsc
, so track line parameters are supported.
GRanges
with the metadata columns described in the details.
GFFFile
class extends RTLFile
and is a
formal represention of a resource in the GFF format.
To cast a path, URL or connection to a GFFFile
, pass it to
the GFFFile
constructor. The GFF1File
, GFF2File
,
GFF3File
, GVFFile
and GTFFile
classes all extend
GFFFile
and indicate a particular version of the format. It has the following utility methods:
genome
: Gets the genome identifier from
the “genome-build” header directive.
GFF is distinguished from the simpler BED format by its flexible
attribute support and its hierarchical structure, as specified by the
group
column in GFF1 (only one level of grouping) and the
Parent
attribute in GFF3. GFF2 does not specify a convention
for representing hierarchies, although its GTF extension provides this
for gene structures. The combination of support for hierarchical data
and arbitrary descriptive attributes makes GFF(3) the preferred format
for representing gene models.
Although GFF features a score
column, large quantitative data
belong in a format like BigWig and alignments from
high-throughput experiments belong in
BAM. For variants, the VCF format (supported
by the VariantAnnotation package) seems to be more widely adopted than
the GVF extension.
A note on the UCSC track line metaformat: track lines are a means for
passing hints to visualization tools like the UCSC Genome Browser and
the Integrated Genome Browser (IGB), and they allow multiple tracks to
be concatenated in the same file. Since GFF is not a UCSC format, it
is not common to annotate GFF data with track lines, but rtracklayer
still supports it. To export or import GFF data in the track line
format, call export.ucsc
or import.ucsc
.
The following is the mapping of GFF elements to a GRanges
object.
NA values are allowed only where indicated.
These appear as a “.” in the file. GFF requires that all columns
are included, so export
generates defaults for missing columns.
ranges
component.
source
column; defaults to “rtracklayer” on export.
type
column; defaults
to “sequence_feature” in the output, i.e., SO:0000110.
score
column, accessible via the score
accessor; defaults
to NA
upon export.
strand
column, accessible via the strand
accessor; defaults
to NA
upon export.
NA
upon export.
seqid
(e.g., chromosome) on export.
In GFF versions 2 and 3, attributes map to arbitrary columns in the
result. In GFF3, some attributes (Parent
, Alias
,
Note
, DBxref
and Ontology_term
) can have
multiple, comma-separated values; these columns are thus always
CharacterList
objects.
test_path <- system.file("tests", package = "rtracklayer")
test_gff3 <- file.path(test_path, "genes.gff3")
## basic import
test <- import(test_gff3)
test
## import.gff functions
import.gff(test_gff3)
import.gff3(test_gff3)
## GFFFile derivatives
test_gff_file <- GFF3File(test_gff3)
import(test_gff_file)
test_gff_file <- GFFFile(test_gff3)
import(test_gff_file)
test_gff_file <- GFFFile(test_gff3, version = "3")
import(test_gff_file)
## from connection
test_gff_con <- file(test_gff3)
test <- import(test_gff_con, format = "gff")
close(test_gff_con)
## various arguments
import(test_gff3, genome = "hg19")
import(test_gff3, colnames = character())
import(test_gff3, colnames = c("type", "geneName"))
## 'which'
which <- GRanges("chr10:90000-93000")
import(test_gff3, which = which)
## Not run:
# ## 'append'
# test_gff3_out <- file.path(tempdir(), "genes.gff3")
#
# export(test[seqnames(test) == "chr10"], test_gff3_out)
# export(test[seqnames(test) == "chr12"], test_gff3_out, append = TRUE)
# import(test_gff3_out)
#
# ## 'index'
# export(test, test_gff3_out, index = TRUE)
# test_bed_gz <- paste(test_gff3_out, ".gz", sep = "")
# import(test_bed_gz, which = which)
# ## End(Not run)
Run the code above in your browser using DataLab