stringdist (version 0.9.6)

seq_amatch: Approximate matching for integer sequences.

Description

For a list of integer vectors x, find the closest matches in a list of integer or numeric vectors in table.

Usage

seq_amatch(
  x,
  table,
  nomatch = NA_integer_,
  matchNA = TRUE,
  method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw"),
  weight = c(d = 1, i = 1, s = 1, t = 1),
  maxDist = 0.1,
  q = 1,
  p = 0,
  bt = 0,
  nthread = getOption("sd_num_thread")
)

seq_ain(x, table, ...)

Arguments

x

(list of) integer or numeric vector(s) to be approximately matched. Will be converted with as.integer.

table

(list of) integer or numeric vector(s) serving as lookup table for matching. Will be converted with as.integer.

nomatch

The value to be returned when no match is found. This is coerced to integer.

matchNA

Should NA's be matched? Default behaviour mimics the behaviour of base match, meaning that NA matches NA. With NA, we mean a missing entry in the list, represented as NA_integer_. If one of the integer sequences stored in the list has an NA entry, this is just treated as another integer (the representation of NA_integer_).

method

Matching algorithm to use. See stringdist-metrics.

weight

For method='osa' or 'dl', the penalty for deletion, insertion, substitution and transposition, in that order. When method='lv', the penalty for transposition is ignored. When method='jw', the weights associated with integers in elements of a, integers in elements of b and the transposition weight, in that order. Weights must be positive and not exceed 1. weight is ignored completely when method='hamming', 'qgram', 'cosine', 'Jaccard', or 'lcs'.

maxDist

Elements in x will not be matched with elements of table if their distance is larger than maxDist. Note that the maximum distance between strings depends on the method: it should always be specified.

q

q-gram size, only when method is 'qgram', 'jaccard', or 'cosine'.

p

Winkler's prefix parameter for Jaro-Winkler distance, with \(0\leq p\leq0.25\). Only when method is 'jw'

bt

Winkler's boost threshold. Winkler's prefix factor is only applied when the Jaro distance is larger than bt. Applies only to method='jw' and p>0.

nthread

Number of threads used by the underlying C-code. A sensible default is chosen, see stringdist-parallelization.

...

parameters to pass to seq_amatch (except nomatch)

Value

seq_amatch returns the position of the closest match of x in table. When multiple matches with the same minimal distance metric exist, the first one is returned. seq_ain returns a logical vector of length length(x) indicating wether an element of x approximately matches an element in table.

Notes

seq_ain is currently defined as

seq_ain(x,table,...) <- function(x,table,...) amatch(x, table, nomatch=0,...) > 0

All input vectors are converted with as.integer. This causes truncation for numeric vectors (e.g. pi will be treated as 3L).

See Also

seq_dist, seq_sim, seq_qgrams

Examples

Run this code
# NOT RUN {
x <- list(1:3,c(3:1),c(1L,3L,4L))
table <- list(
  c(5L,3L,1L,2L)
  ,1:4
)
seq_amatch(x,table,maxDist=2)

# behaviour with missings
seq_amatch(list(c(1L,NA_integer_,3L),NA_integer_), list(1:3),maxDist=1)


# }
# NOT RUN {
# Match sentences based on word order. Note: words must match exactly or they
# are treated as completely different.
#
# For this example you need to have the 'hashr' package installed.
x <- "Mary had a little lamb"
x.words <- strsplit(x,"[[:blank:]]+")
x.int <- hashr::hash(x.words)
table <- c("a little lamb had Mary",
           "had Mary a little lamb")
table.int <- hashr::hash(strsplit(table,"[[:blank:]]+"))
seq_amatch(x.int,table.int,maxDist=3)
# }

Run the code above in your browser using DataCamp Workspace