AlignDB: Align Two Sets of Aligned Sequences In A Sequence Database

Description

Merges the two separate sequence alignments in a database. The aligned sequences must have separate identifiers in the same table or be located in different database tables.

Usage

AlignDB(dbFile, tblName = "DNA", identifier = "", type = "DNAStringSet", add2tbl = "DNA", batchSize = 10000, perfectMatch = NULL, misMatch = NULL, gapOpening = NULL, gapExtension = NULL, terminalGap = -1, substitutionMatrix = NULL, processors = NULL, verbose = TRUE)

Arguments

dbFile

A SQLite connection object or a character string specifying the path to the database file.

tblName

Character string specifying the table(s) where the sequences are located. If two tblNames are provided then the sequences in both tables will be aligned.

identifier

Optional character string used to narrow the search results to those matching a specific identifier. If "" then all identifiers are selected. If two identifiers are provided then the set of sequences matching each identifier will be aligned.

type

The type of XStringSet being processed. This should be (an unambiguous abbreviation of) one of "AAStringSet", "DNAStringSet", or "RNAStringSet".

add2tbl

Character string specifying the table name in which to add the aligned sequences.

batchSize

Integer specifying the number of sequences to process at a time.

perfectMatch

Numeric giving the reward for aligning two matching nucleotides in the alignment, or NULL to determine the value based on input type (DNA/RNA/AA).

misMatch

Numeric giving the cost for aligning two mismatched nucleotides in the alignment, or NULL to determine the value based on input type (DNA/RNA/AA).

gapOpening

Numeric giving the cost for opening a gap in the alignment, or NULL to determine the value based on input type (DNA/RNA/AA).

gapExtension

Numeric giving the cost for extending an open gap in the alignment, or NULL to determine the value based on input type (DNA/RNA/AA).

terminalGap

Numeric giving the cost for allowing leading and trailing gaps in the alignment. Either two numbers, the first for leading gaps and the second for trailing gaps, or a single number for both.

substitutionMatrix

Either a substitution matrix representing the substitution scores for an alignment or the name of the amino acid substitution matrix to use in alignment. The latter may be one of the following: ``BLOSUM45'', ``BLOSUM50'', ``BLOSUM62'', ``BLOSUM80'', ``BLOSUM100'', ``PAM30'', ``PAM40'', ``PAM70'', ``PAM120'', ``PAM250''. The default (NULL) will use the perfectMatch and misMatch penalties for DNA/RNA or ``BLOSUM62'' for AA. (See examples section below.)

processors

The number of processors to use, or NULL (the default) for all available processors.

verbose

Logical indicating whether to display progress.

Value

Returns the number of newly aligned sequences added to the database.

Details

Sometimes it is useful to align two large sets of sequences, where each set of sequences is already aligned but the two sets are not aligned to each other. AlignDB first builds a profile of each sequence set in increments of batchSize so that the entire sequence set is not required to fit in memory. Next the two profiles are aligned using dynamic programming. Finally, the new alignment is applied to all the sequences as they are incrementally added to the add2tbl.

Two identifiers or tblNames must be provided, indicating the two sets of sequences to align. The sequences corresponding to the first identifier and tblName will be aligned to those of the second identifier or tblName. The aligned sequences are added to add2tbl under a new identifier formed from the concatenation of the two identifiers or tblNames. (See examples section below.)

Examples

Run this code

gen <- system.file("extdata", "Bacteria_175seqs.gen", package="DECIPHER")
fas <- system.file("extdata", "Bacteria_175seqs.fas", package="DECIPHER")

# Align two tables and place result into a third
dbConn <- dbConnect(SQLite(), ":memory:")
Seqs2DB(gen, "GenBank", dbConn, "Seqs1", tblName="Set1")
Seqs2DB(fas, "FASTA", dbConn, "Seqs2", tblName="Set2")
AlignDB(dbConn, tblName=c("Set1", "Set2"), add2tbl="AlignedSets")
l <- IdLengths(dbConn, "AlignedSets", add2tbl=TRUE)
BrowseDB(dbConn, tblName="AlignedSets") # all sequences have the same width
dbDisconnect(dbConn)

# Align two identifiers and place the result in the same table
dbConn <- dbConnect(SQLite(), ":memory:")
Seqs2DB(gen, "GenBank", dbConn, "Seqs1")
Seqs2DB(fas, "FASTA", dbConn, "Seqs2")
AlignDB(dbConn, identifier=c("Seqs1", "Seqs2"))
l <- IdLengths(dbConn, add2tbl=TRUE)
BrowseDB(dbConn) # note the sequences with a new identifier
dbDisconnect(dbConn)

Run the code above in your browser using DataLab