Learn R Programming

comparator (version 0.1.2)

LCS: Longest Common Subsequence (LCS) Comparator

Description

The Longest Common Subsequence (LCS) distance between two strings/sequences \(x\) and \(y\) is the minimum cost of operations (insertions and deletions) required to transform \(x\) into \(y\). The LCS similarity is more commonly used, which can be interpreted as the length of the longest subsequence common to \(x\) and \(y\).

Usage

LCS(
  deletion = 1,
  insertion = 1,
  normalize = FALSE,
  similarity = FALSE,
  ignore_case = FALSE,
  use_bytes = FALSE
)

Value

A LCS instance is returned, which is an S4 class inheriting from StringComparator.

Arguments

deletion

positive cost associated with deletion of a character or sequence element. Defaults to unit cost.

insertion

positive cost associated insertion of a character or sequence element. Defaults to unit cost.

normalize

a logical. If TRUE, distances are normalized to the unit interval. Defaults to FALSE.

similarity

a logical. If TRUE, similarity scores are returned instead of distances. Defaults to FALSE.

ignore_case

a logical. If TRUE, case is ignored when comparing strings.

use_bytes

a logical. If TRUE, strings are compared byte-by-byte rather than character-by-character.

Details

For simplicity we assume x and y are strings in this section, however the comparator is also implemented for more general sequences.

An LCS similarity is returned if similarity = TRUE, which is defined as $$\mathrm{sim}(x, y) = \frac{w_d |x| + w_i |y| - \mathrm{dist}(x, y)}{2},$$ where \(|x|\), \(|y|\) are the number of characters in \(x\) and \(y\) respectively, \(dist\) is the LCS distance, \(w_d\) is the cost of a deletion and \(w_i\) is the cost of an insertion.

Normalization of the LCS distance/similarity to the unit interval is also supported by setting normalize = TRUE. The normalization approach follows Yujian and Bo (2007), and ensures that the distance remains a metric when the costs of insertion \(w_i\) and deletion \(w_d\) are equal. The normalized distance \(\mathrm{dist}_n\) is defined as $$\mathrm{dist}_n(x, y) = \frac{2 \mathrm{dist}(x, y)}{w_d |x| + w_i |y| + \mathrm{dist}(x, y)},$$ and the normalized similarity \(\mathrm{sim}_n\) is defined as $$\mathrm{sim}_n(x, y) = 1 - \mathrm{dist}_n(x, y) = \frac{\mathrm{sim}(x, y)}{w_d |x| + w_i |y| - \mathrm{sim}(x, y)}.$$

References

Bergroth, L., Hakonen, H., & Raita, T. (2000), "A survey of longest common subsequence algorithms", Proceedings Seventh International Symposium on String Processing and Information Retrieval (SPIRE'00), 39-48.

Yujian, L. & Bo, L. (2007), "A Normalized Levenshtein Distance Metric", IEEE Transactions on Pattern Analysis and Machine Intelligence 29, 1091–1095.

See Also

Other edit-based comparators include Hamming, Levenshtein, OSA and DamerauLevenshtein.

Examples

Run this code
## There are no common substrings of size 3 for the following example, 
## however there are two common substrings of size 2: "AC" and "BC". 
## Hence the LCS similarity is 2.
x <- "ABCDA"; y <- "BAC"
LCS(similarity = TRUE)(x, y)

## Levenshtein distance reduces to LCS distance when the cost of 
## substitution is high
x <- "ABC"; y <- "AAA"
LCS()(x, y) == Levenshtein(substitution = 100)(x, y)

Run the code above in your browser using DataLab