A prokaryotic Open Reading Frame (ORF) is defined as a subsequence starting with a start-codon
(ATG, GTG or TTG), followed by an integer number of triplets (codons), and ending with a stop-codon (TAA,
TGA or TAG, unless trans.tab
is not 1, see below). This function will locate all ORFs in a genome.
The argument genome
will typically have several sequences (chromosomes/plasmids/scaffolds/contigs).
It is vital that the first token (characters before first space) of every genome$Header
is
unique, since this will be used to identify these genome sequences in the output.
An alternative translation table may be specified, and as of now the only alternative implemented is table 4.
This means codon TGA is no longer a stop, but codes for Tryptophan. This coding is used by some bacteria
(e.g. Mycoplasma, Mesoplasma).
Note that for any given stop-codon there are usually multiple start-codons in the same reading
frame. This function will return all, i.e. the same stop position may appear multiple times. If
you want ORFs with the most upstream start-codon only (LORFs), then filter the output from this function
with lorfs
.
By default the genome sequences are assumed to be linear, i.e. contigs or other incomplete fragments
of a genome. In such cases there will usually be some truncated ORFs at each end, i.e. ORFs where either
the start- or the stop-codon is lacking. In the gff.table
returned by this function this is marked in the
Attributes column. The texts "Truncated=10" or "Truncated=01" indicates truncated at
the Start or End, respectively. If the supplied genome
is a completed genome, with
circular chromosome/plasmids, set the flag circular=TRUE
and no truncated ORFs will be listed.
In cases where an ORF runs across the origin of a circular genome sequences, the Stop coordinate will be
larger than the length of the genome sequence. This is in line with the specifications of the GFF3 format, where
a Start cannot be larger than the corresponding End.