
Last chance! 50% off unlimited learning
Sale ends in
seq_dist
computes pairwise string distances between elements of
a
and b
, where the argument with less elements is recycled.
seq_distmatrix
computes the distance matrix with rows according to
a
and columns according to b
.
seq_dist(a, b, method = c("osa", "lv", "dl", "hamming", "lcs", "qgram",
"cosine", "jaccard", "jw"), weight = c(d = 1, i = 1, s = 1, t = 1), q = 1,
p = 0, bt = 0, nthread = getOption("sd_num_thread"))seq_distmatrix(a, b, method = c("osa", "lv", "dl", "hamming", "lcs", "qgram",
"cosine", "jaccard", "jw"), weight = c(d = 1, i = 1, s = 1, t = 1), q = 1,
p = 0, bt = 0, useNames = c("names", "none"),
nthread = getOption("sd_num_thread"))
(list
of) integer
or numeric
vector(s). Will be converted with as.integer
(target)
(list
of) integer
or numeric
vector(s). Will be converted with as.integer
(source).
Optional for seq_distmatrix
.
Distance metric. See stringdist-metrics
For method='osa'
or 'dl'
, the penalty for
deletion, insertion, substitution and transposition, in that order. When
method='lv'
, the penalty for transposition is ignored. When
method='jw'
, the weights associated with characters of a
,
characters from b
and the transposition weight, in that order.
Weights must be positive and not exceed 1. weight
is ignored
completely when method='hamming'
, 'qgram'
, 'cosine'
,
'Jaccard'
, or 'lcs'
Size of the method='qgram'
, 'jaccard'
or 'cosine'
.
Penalty factor for Jaro-Winkler distance. The valid range for
p
is 0 <= p <= 0.25
. If p=0
(default), the
Jaro-distance is returned. Applies only to method='jw'
.
Winkler's boost threshold. Winkler's penalty factor is
only applied when the Jaro distance is larger than bt
Applies only to method='jw'
and p>0
.
Maximum number of threads to use. By default, a sensible
number of threads is chosen, see stringdist-parallelization
.
label the output matrix with names(a)
and names(b)
?
seq_dist
returns a numeric vector with pairwise distances between a
and b
of length max(length(a),length(b)
.
For seq_distmatrix
there are two options. If b
is missing, the
dist
object corresponding to the length(a) X
length(a)
distance matrix is returned. If b
is specified, the
length(a) X length(b)
distance matrix is returned.
If any element of a
or b
is NA_integer_
, the distance with
any matched integer vector will result in NA
. Missing values in the sequences
themselves are treated as a number and not treated specially (Also see the examples).
Input vectors are converted with as.integer
. This causes truncation for numeric
vectors (e.g. pi
will be treated as 3L
).
# NOT RUN {
# Distances between lists of integer vectors. Note the postfix 'L' to force
# integer storage. The shorter argument is recycled over (\code{a})
a <- list(c(102L, 107L)) # fu
b <- list(c(102L,111L,111L),c(102L,111L,111L)) # foo, fo
seq_dist(a,b)
# translate strings to a list of integer sequences
a <- lapply(c("foo","bar","baz"),utf8ToInt)
seq_distmatrix(a)
# Note how missing values are treated. NA's as part of the sequence are treated
# as an integer (the representation of NA_integer_).
a <- list(NA_integer_,c(102L, 107L))
b <- list(c(102L,111L,111L),c(102L,111L,NA_integer_))
seq_dist(a,b)
# }
# NOT RUN {
# Distance between sentences based on word order. Note: words must match exactly or they
# are treated as completely different.
#
# For this example you need to have the 'hashr' package installed.
a <- "Mary had a little lamb"
a.words <- strsplit(a,"[[:blank:]]+")
a.int <- hashr::hash(a.words)
b <- c("a little lamb had Mary",
"had Mary a little lamb")
b.int <- hashr::hash(strsplit(b,"[[:blank:]]+"))
seq_dist(a.int,b.int)
# }
# NOT RUN {
# }
Run the code above in your browser using DataLab