readGenBank: Read a GenBank File

Description

Read a GenBank file from a local file, or retrieve and read one based on an accession number. See Details for exact behavior.

Usage

readGenBank(file, text = readLines(file), partial = NA, ret.seq = TRUE, verbose = FALSE)

Arguments

file

character or GBAccession. The path to the file, or a GBAccession object containing Nuccore versioned accession numbers. Ignored if text is specified.

text

character. The text of the file. Defaults to text within file

partial

logical. If TRUE, features with non-exact boundaries will be included. Otherwise, non-exact features are excluded, with a warning if partial is NA (the default).

ret.seq

logical. Should an object containing the raw ORIGIN sequence be created and returned. Defaults to TRUE. If FALSE, the sequence slot is set to NULL. See NOTE.

verbose

logical. Should informative messages be printed to the console as the file is processed. Defaults to FALSE.

Value

A GenBankRecord object containing (most, see detaisl) of the information within file/text Or a list of GenBankRecord objects in cases where a GBAccession vector with more than one ID in it is passed to file

Details

If a a GBAccession object is passed to file, the rentrez package is used to attempt to fetch full GenBank records for all ids in the

Often times, GenBank files don't contain exhaustive annotations. For example, files including CDS annotations often do not have separate transcript features. Furthermore, chromosomes are not always named, particularly in organisms that have only one. The details of how genbankr handles such cases are as follows:

In files where CDSs are annotated but individual exons are not, 'approximate exons' are defined as the individual contiguous elements within each CDS. Currently, no mixing of approximate and explicitly annotated exons is performed, even in cases where, e.g., exons are not annotated for some genes with CDS annotations.

In files where transcripts are not present, 'approximate transcripts' defined by the ranges spanned by groups of exons are used. Currently, we do not support generating approximate transcripts from CDSs in files that contain actual transcript annotations, even if those annotations do not cover all genes with CDS/exon annotations.

Features (gene, cds, variant, etc) are assumed to be contained within the most recent previous source feature (chromosome/physical piece of DNA). Chromosome name for source features (seqnames in the resulting GRanges/VRanges is determined as follows:

The 'chromosome' attribute, as is (e.g., "chr1");
the 'strain' attribute, combined with auto-generated count (e.g., "VR1814:1");
the 'organism' attribute, combined with auto-generated count (e.g. "Human herpesvirus 5:1".

In files where no origin sequence is present, importing varation features is not currently supported, as there is no easy/ self-contained way of determining the reference in those situations and the features themselves list only alt. If variation features are present in a file without origin sequence, those features are ignored with a warning.

Currently some information about from the header of a GenBank file, primarily reference and author based information, is not captured and returned. Please contact the maintainer if you have a direct use-case for this type of information.

Examples

Run this code

gb = readGenBank(system.file("sample.gbk", package="genbankr"))

Run the code above in your browser using DataLab