Learn R Programming

seqtrie (version 0.3.5)

dist_pairwise: Pairwise distance between two sets of sequences

Description

Compute the pairwise distance between two sets of sequences

Usage

dist_pairwise(
  query,
  target,
  mode,
  cost_matrix = NULL,
  gap_cost = NA_integer_,
  gap_open_cost = NA_integer_,
  nthreads = 1,
  show_progress = FALSE
)

Value

The output of this function is a vector of distances. If mode == "anchored" then the output also includes attributes "query_size" and "target_size" which are vectors containing the lengths of the query and target sequences that are aligned.

Arguments

query

A character vector of query sequences.

target

A character vector of target sequences.. Must be the same length as query.

mode

The distance metric to use. One of hamming (hm), global (gb) or anchored (an).

cost_matrix

A custom cost matrix for use with the "global" or "anchored" distance metrics. See details.

gap_cost

The cost of a gap for use with the "global" or "anchored" distance metrics. See details.

gap_open_cost

The cost of a gap opening. See details.

nthreads

The number of threads to use for parallel computation.

show_progress

Whether to show a progress bar.

Details

This function calculates pairwise distances based on Hamming, Levenshtein or Anchored algorithms. query and target must be the same length.

Three types of distance metrics are supported, based on the form of alignment performed. These are: Hamming, Global (Levenshtein) and Anchored.

An anchored alignment is a form of semi-global alignment, where the query sequence is "anchored" (global) to the beginning of both the query and target sequences, but is semi-global in that the end of the either the query sequence or target sequence (but not both) can be unaligned. This type of alignment is sometimes called an "extension" alignment in literature.

In contrast a global alignment must align the entire query and target sequences. When mismatch and indel costs are equal to 1, this is also known as the Levenshtein distance.

By default, if mode == "global" or "anchored", all mismatches and indels are given a cost of 1. However, you can define your own distance metric by setting the substitution cost_matrix and separate gap parameters. The cost_matrix is a strictly positive square integer matrix of substitution costs and should include all characters in query and target as column- and rownames. Any rows/columns named "gap" or "gap_open" are ignored. To set the cost of a gap (insertion or deletion), use the gap_cost parameter (a single positive integer). To enable affine gaps, provide the gap_open_cost parameter (a single positive integer) in addition to gap_cost. If affine alignment is used, the total cost of a gap of length L is defined as: TOTAL_GAP_COST = gap_open_cost + (gap_cost * gap_length).

If mode == "hamming" all alignment parameters are ignored; mismatch is given a distance of 1 and gaps are not allowed.

Examples

Run this code
dist_pairwise(c("ACGT", "AAAA"), c("ACG", "ACGT"), mode = "global")

Run the code above in your browser using DataLab