AlignTranslation: Align Sequences By Their Amino Acid Translation

Description

Performs alignment of a set of DNA or RNA sequences by aligning their corresponding amino acid sequences.

Usage

AlignTranslation(myXStringSet, sense = "+", direction = "5' to 3'", readingFrame = NA, asAAStringSet = FALSE, geneticCode = GENETIC_CODE, ...)

Arguments

myXStringSet

A DNAStringSet or RNAStringSet object of unaligned sequences.

sense

Single character specifying sense of the input sequences, either the positive ("+") coding strand or negative ("-") non-coding strand.

direction

Direction of the input sequences, either "5' to 3'" or "3' to 5'".

readingFrame

Numeric vector giving a single reading frame for all of the sequences, or an individual reading frame for each sequence in myXStringSet. The readingFrame can be either 1, 2, 3 to begin translating on the first, second, and third nucleotide position, or NA (the default) to guess the reading frame. (See details section below.)

asAAStringSet

Logical determining whether to return the aligned translation as an AAStringSet rather than the input type. Incomplete starting and ending codons will be translated into the mask character ("+").

geneticCode

Named character vector in the same format as GENETIC_CODE (the default), which represents the standard genetic code.

...

Further arguments to be passed directly to AlignSeqs, including gapOpening, gapExtension, gapPower, terminalGap, restrict, anchor, normPower, substitutionMatrix, structureMatrix, guideTree, iterations, refinements, structures, FUN, and levels.

Value

An XStringSet matching the input type.

Details

Alignment of proteins is often more accurate than alignment of their coding nucleic acid sequences. This function aligns the input nucleic acid sequences via aligning their translated amino acid sequences. First, the input sequences are translated according to the specified sense, direction, and readingFrame. The resulting amino acid sequences are aligned using AlignSeqs, and this alignment is reverse translated into the original sequence type, sense, and direction. Not only is alignment of protein sequences more accurate, but aligning translations will ensure that the reading frame is maintained in the nucleotide sequences.

If the readingFrame is NA (the default) then an attempt is made to guess the reading frame of each sequence based on the number of stop codons in the translated amino acids. For each sequence, the first reading frame will be chosen (either 1, 2, or 3) without stop codons, except in the last position. If the number of stop codons is inconclusive for a sequence then the reading frame will default to 1. The entire length of each sequence is translated in spite of any stop codons identified. Note that this method is only constructive in circumstances where there is a substantially long coding sequence with at most a single stop codon expected in the final position, and therefore it is preferable to specify the reading frame of each sequence if it is known.

References

ES Wright (2015) "DECIPHER: harnessing local sequence context to improve protein multiple sequence alignment". BMC Bioinformatics, doi:10.1186/s12859-015-0749-z.

Examples

Run this code

# first three sequences translate to  MFITP*
# and the last sequence translates as MF-TP*
rna <- RNAStringSet(c("AUGUUCAUCACCCCCUAA", "AUGUUCAUAACUCCUUGA",
	"AUGUUCAUUACACCGUAG", "AUGUUUACCCCAUAA"))
RNA <- AlignSeqs(rna, verbose=FALSE)
RNA

RNA <- AlignTranslation(rna, verbose=FALSE)
RNA

AA <- AlignTranslation(rna, asAAStringSet=TRUE, verbose=FALSE)
AA

# example of aligning many protein coding sequences:
fas <- system.file("extdata", "50S_ribosomal_protein_L2.fas", package="DECIPHER")
dna <- readDNAStringSet(fas)
DNA <- AlignTranslation(dna) # align the translation then reverse translate
DNA

Run the code above in your browser using DataLab