Learn R Programming

stringdist (version 0.8.2)

amatch: Approximate string matching

Description

Approximate string matching equivalents of R's native match and %in%.

Usage

amatch(x, table, nomatch = NA_integer_, matchNA = TRUE, method = c("osa",
  "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex"),
  useBytes = FALSE, weight = c(d = 1, i = 1, s = 1, t = 1), maxDist = 0.1,
  q = 1, p = 0)

ain(x, table, ...)

Arguments

x
vector: elements to be approximately matched: will be coerced to character.
table
vector: lookup table for matching. Will be coerced to character.
nomatch
The value to be returned when no match is found. This is coerced to integer. nomatch=0 can be a useful option.
matchNA
Should NA's be matched? Default behaviour mimics the behaviour of base match, meaning that NA matches NA (see also the note on NA handling below).
method
Matching algorithm to use. See stringdist.
useBytes
Perform byte-wise comparison. useBytes=TRUE is faster but may yield different results depending on character encoding. See also stringdist, under encoding issues.
weight
Weight parameters for matching algorithm See stringdist.
maxDist
Elements in x will not be matched with elements of table if their distance is larger than maxDist.
q
q-gram size, see stringdist.
p
Winklers penalty parameter for Jaro-Winkler distance, see stringdist.
...
parameters to pass to amatch (except nomatch)

Value

  • amatch returns the position of the closest match of x in table. When multiple matches with the same smallest distance metric exist, the first one is returned. ain returns a logical vector of length length(x) indicating wether an element of x approximately matches an element in table.

Note on <code>NA</code> handling

R's native match function matches NA with NA. This may feel inconsistent with R's usual NA handling, since for example NA==NA yields NA rather than TRUE. In most cases, one may reason about the behaviour under NA along the lines of ``if one of the arguments is NA, the result shall be NA'', simply because not all information necessary to execute the function is available. One uses special functions such as is.na, is.null etc. to handle special values.

The amatch function mimics the behaviour of match by default: NA is matched with NA and with nothing else. Note that this is inconsistent with the behaviour of stringdist since stringdist yields NA when at least one of the arguments is NA. The same inconsistency exists between match and adist. In amatch this behaviour can be controlled by setting matchNA=FALSE. In that case, if any of the arguments in x is NA, the nomatch value is returned, regardless of whether NA is present in table. In match the behaviour can be controlled by setting the incomparables option.

Details

ain is currently defined as

ain(x,table,...) <- function(x,table,...) amatch(x, table, nomatch=0,...) > 0

Examples

Run this code
# lets see which sci-fi heroes are stringdistantly nearest
amatch("leia",c("uhura","leela"),maxDist=5)

# we can restrict the search
amatch("leia",c("uhura","leela"),maxDist=1)

# setting nomatch returns a different value when no match is found
amatch("leia",c("uhura","leela"),maxDist=1,nomatch=0)

# this is always true if maxDist is Inf
ain("leia",c("uhura","leela"),maxDist=Inf)

# Let's look in a neighbourhood of maximum 2 typo's (by default, the OSA algorithm is used)
ain("leia",c("uhura","leela"), maxDist=2)

Run the code above in your browser using DataLab