afind slides a window of fixed width over a string x and
computes the distance between the each window and the sought-after
pattern. The location, content, and distance corresponding to the
window with the best match is returned.
afind(
x,
pattern,
window = NULL,
value = TRUE,
method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "running_cosine",
"jaccard", "jw", "soundex"),
useBytes = FALSE,
weight = c(d = 1, i = 1, s = 1, t = 1),
q = 1,
p = 0,
bt = 0,
nthread = getOption("sd_num_thread")
)grab(x, pattern, maxDist = Inf, value = FALSE, ...)
grabl(x, pattern, maxDist = Inf, ...)
extract(x, pattern, maxDist = Inf, ...)
strings to search in
strings to find (not a regular expression). For grab,
grabl, and extract this must be a single string.
width of moving window.
toggle return matrix with matched strings.
Matching algorithm to use. See stringdist-metrics.
Perform byte-wise comparison. See stringdist-encoding.
For method='osa' or 'dl', the penalty for
deletion, insertion, substitution and transposition, in that order. When
method='lv', the penalty for transposition is ignored. When
method='jw', the weights associated with characters of a,
characters from b and the transposition weight, in that order.
Weights must be positive and not exceed 1. weight is ignored
completely when method='hamming', 'qgram', 'cosine',
'Jaccard', 'lcs', or 'soundex'.
q-gram size, only when method is 'qgram', 'jaccard',
or 'cosine'.
Winklers 'prefix' parameter for Jaro-Winkler distance, with
\(0\leq p\leq0.25\). Only when method is 'jw'
Winkler's boost threshold. Winkler's prefix factor is
only applied when the Jaro distance is larger than bt.
Applies only to method='jw' and p>0.
Number of threads used by the underlying C-code. A sensible
default is chosen, see stringdist-parallelization.
Only windows with distance <= maxDist are considered a match.
passed to afind.
For afind: a list of three matrices, each with
length(x) rows and length(pattern) columns. In each matrix,
element \((i,j)\) corresponds to x[i] and pattern[j]. The
names and description of each matrix is as follows.
location. [integer], location of the start of best matching window.
When useBytes=FALSE, this corresponds to the location of a UTF code point
in x, possibly after conversion from its original encoding.
distance. [character], the string distance between pattern and
the best matching window.
match. [character], the first, best matching window.
For grab, an integer vector, indicating in which elements of
x a match was found with a distance <= maxDist. The matched
values when value=TRUE (equivalent to grep).
For grabl, a logical vector, indicating in which elements of
x a match was found with a distance <= maxDist. (equivalent
to grepl).
For extract, a character matrix with length(x) rows and
length(pattern) columns. If match was found, element \((i,j)\)
contains the match, otherwise it is set to NA.
This algorithm gains efficiency by using that two consecutive windows have
a large overlap in their q-gram profiles. It gives the same result as
the "cosine" distance, but much faster.
Matching is case-sensitive. Both x and pattern are converted
to UTF-8 prior to search, unless useBytes=TRUE, in which case
the distances are measured bytewise.
Code is parallelized over the x variable: each value of x
is scanned for every element in pattern using a separate thread (when nthread
is larger than 1).
The functions grab and grabl are approximate string matching
functions that somewhat resemble base R's grep and
grepl. They are implemented as convenience wrappers
of afind.
Other matching:
amatch()
# NOT RUN {
texts = c("When I grow up, I want to be"
, "one of the harvesters of the sea"
, "I think before my days are gone"
, "I want to be a fisherman")
patterns = c("fish", "gone","to be")
afind(texts, patterns, method="running_cosine", q=3)
grabl(texts,"grew", maxDist=1)
extract(texts, "harvested", maxDist=3)
# }
Run the code above in your browser using DataLab