
Last chance! 50% off unlimited learning
Sale ends in
For a list
of integer vectors x
, find the closest matches in a
list
of integer or numeric vectors in table.
seq_amatch(
x,
table,
nomatch = NA_integer_,
matchNA = TRUE,
method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw"),
weight = c(d = 1, i = 1, s = 1, t = 1),
maxDist = 0.1,
q = 1,
p = 0,
bt = 0,
nthread = getOption("sd_num_thread")
)seq_ain(x, table, ...)
seq_amatch
returns the position of the closest match of x
in table
. When multiple matches with the same minimal distance
metric exist, the first one is returned. seq_ain
returns a
logical
vector of length length(x)
indicating wether an
element of x
approximately matches an element in table
.
(list
of) integer
or numeric
vector(s) to be
approximately matched. Will be converted with as.integer
.
(list
of) integer
or numeric
vector(s)
serving as lookup table for matching. Will be converted with
as.integer
.
The value to be returned when no match is found. This is coerced to integer.
Should NA
's be matched? Default behaviour mimics the
behaviour of base match
, meaning that NA
matches
NA
. With NA
, we mean a missing entry in the list
, represented as NA_integer_
.
If one of the integer sequences stored in the list has an NA
entry,
this is just treated as another integer (the representation of
NA_integer_
).
Matching algorithm to use. See stringdist-metrics
.
For method='osa'
or 'dl'
, the penalty for
deletion, insertion, substitution and transposition, in that order. When
method='lv'
, the penalty for transposition is ignored. When
method='jw'
, the weights associated with integers in elements of a
,
integers in elements of b
and the transposition weight, in that order.
Weights must be positive and not exceed 1. weight
is ignored
completely when method='hamming'
, 'qgram'
, 'cosine'
,
'Jaccard'
, or 'lcs'
.
Elements in x
will not be matched with elements of
table
if their distance is larger than maxDist
. Note that the
maximum distance between strings depends on the method: it should always be
specified.
q-gram size, only when method is 'qgram'
, 'jaccard'
,
or 'cosine'
.
Winkler's prefix parameter for Jaro-Winkler distance, with
'jw'
Winkler's boost threshold. Winkler's prefix factor is
only applied when the Jaro distance is larger than bt
.
Applies only to method='jw'
and p>0
.
Number of threads used by the underlying C-code. A sensible
default is chosen, see stringdist-parallelization
.
parameters to pass to seq_amatch
(except nomatch
)
seq_ain
is currently defined as
seq_ain(x,table,...) <- function(x,table,...) amatch(x, table, nomatch=0,...) > 0
All input vectors are converted with as.integer
. This causes truncation for numeric
vectors (e.g. pi
will be treated as 3L
).
seq_dist
, seq_sim
, seq_qgrams
x <- list(1:3,c(3:1),c(1L,3L,4L))
table <- list(
c(5L,3L,1L,2L)
,1:4
)
seq_amatch(x,table,maxDist=2)
# behaviour with missings
seq_amatch(list(c(1L,NA_integer_,3L),NA_integer_), list(1:3),maxDist=1)
if (FALSE) {
# Match sentences based on word order. Note: words must match exactly or they
# are treated as completely different.
#
# For this example you need to have the 'hashr' package installed.
x <- "Mary had a little lamb"
x.words <- strsplit(x,"[[:blank:]]+")
x.int <- hashr::hash(x.words)
table <- c("a little lamb had Mary",
"had Mary a little lamb")
table.int <- hashr::hash(strsplit(table,"[[:blank:]]+"))
seq_amatch(x.int,table.int,maxDist=3)
}
Run the code above in your browser using DataLab