Learn R Programming

microseq (version 1.3)

findOrfs: Finding ORFs in genomes

Description

Finds all ORFs in prokaryotic genome sequences.

Usage

findOrfs(genome, circular = F, trans.tab = 1)

Arguments

genome

A Fasta object with the genome sequence(s).

circular

Logical indicating if the genome sequences are completed, circular sequences.

trans.tab

Translation table

Value

This function returns a gff.table, which is simply a data.frame with columns adhering to the format specified by the GFF3 format, see readGFF. If you want to retrieve the ORF sequences, use gff2fasta.

Details

A prokaryotic Open Reading Frame (ORF) is defined as a subsequence starting with a start-codon (ATG, GTG or TTG), followed by an integer number of triplets (codons), and ending with a stop-codon (TAA, TGA or TAG, unless trans.tab is not 1, see below). This function will locate all ORFs in a genome.

The argument genome will typically have several sequences (chromosomes/plasmids/scaffolds/contigs). It is vital that the first token (characters before first space) of every genome$Header is unique, since this will be used to identify these genome sequences in the output.

An alternative translation table may be specified, and as of now the only alternative implemented is table 4. This means codon TGA is no longer a stop, but codes for Tryptophan. This coding is used by some bacteria (e.g. Mycoplasma, Mesoplasma).

Note that for any given stop-codon there are usually multiple start-codons in the same reading frame. This function will return all, i.e. the same stop position may appear multiple times. If you want ORFs with the most upstream start-codon only (LORFs), then filter the output from this function with lorfs.

By default the genome sequences are assumed to be linear, i.e. contigs or other incomplete fragments of a genome. In such cases there will usually be some truncated ORFs at each end, i.e. ORFs where either the start- or the stop-codon is lacking. In the gff.table returned by this function this is marked in the Attributes column. The texts "Truncated=10" or "Truncated=01" indicates truncated at the Start or End, respectively. If the supplied genome is a completed genome, with circular chromosome/plasmids, set the flag circular=TRUE and no truncated ORFs will be listed. In cases where an ORF runs across the origin of a circular genome sequences, the Stop coordinate will be larger than the length of the genome sequence. This is in line with the specifications of the GFF3 format, where a Start cannot be larger than the corresponding End.

See Also

readGFF, gff2fasta, lorfs.

Examples

Run this code
# NOT RUN {
# Using a genome file in this package
xpth <- file.path(path.package("microseq"),"extdata")
genome.file <- file.path(xpth,"small_genome.fasta")

# Reading genome and finding orfs
genome <- readFasta(genome.file)
orf.tbl <- findOrfs(genome)

# Computing ORF-lengths
orf.lengths <- orfLength(orf.tbl)

# Filtering to retrieve the LORFs only
lorf.table <- lorfs(orf.tbl)

# }

Run the code above in your browser using DataLab