Compute semantic distances (or similarities) between pairs of target terms based on a scored DSM matrix M
,
according to any of the distance measures supported by dist.matrix
.
If one of the terms in a pair is not represented in the DSM, the distance is set to Inf
(or to -Inf
in the case of a similarity measure).
pair.distances(w1, w2, M, ..., transform = NULL,
rank = c("none", "fwd", "bwd", "avg"),
avg.method = c("arithmetic", "geometric", "harmonic"),
batchsize = 10e6, verbose = FALSE)
If rank="none"
(the default), a numeric vector of the same length as w1
and w2
specifying the distances or similarities between the term pairs, according to the metric selected with the extra arguments (...
).
Otherwise, an integer or numeric vector of the same length as w1
and w2
specifying
forward, backward or average neighbour rank for the two terms.
In either case, a distance or rank of Inf
(or a similarity of -Inf
) is returned for any term pair not represented in the DSM.
Attribute similarity
is set to TRUE
if the returned values are similarity scores rather than distances.
a character vector specifying the first term of each pair
a character vector of the same length as w1
, specifying the second term of each pair
a sparse or dense DSM matrix, suitable for passing to dist.matrix
, or an object of class dsm
. Alternatively, M
can be a pre-computed distance or similarity matrix returned by dist.matrix
or marked as such with as.distmat
.
further arguments are passed to dist.matrix
and determine the distance or similarity measure to be used (see dist.matrix
for details)
whether to return the distance between the two terms ("none"
) or the neighbour rank (see “Details” below)
an optional transformation function applied to the distance, similarity or rank values (e.g. transform=log10
for logarithmic ranks). This option is provided as a convenience for evaluation code that calls pair.distances
with user-specified arguments.
with rank="avg"
, whether to compute the arithmetic, geometric or harmonic mean of forward and backward rank
maximum number of similarity values to compute per batch. This parameter has an essential influence on efficiency and memory use of the algorithm and has to be tuned carefully for optimal performance.
if TRUE
, display some progress messages indicating how data are split into batches
Stephanie Evert (https://purl.org/stephanie.evert)
The rank
argument controls whether semantic distance is measured directly by geometric distance (none
),
by forward neighbour rank (fwd
), by backward neighbour rank (bwd
), or by the average of forward and backward rank (avg
).
Forward neighbour rank is the rank of w2
among the nearest neighbours of w1
.
Backward neighbour rank is the rank of w1
among the nearest neighbours of w2
.
The average can be computed as an arithmetic, geometric or harmonic mean, depending on avg.method
.
Note that a transformation function is applied after averaging.
In order to compute the arithmetic mean of log ranks, set transform=log10
, rank="avg"
and avg.method="geometric"
.
Neighbour ranks assume that each target term is its own nearest neighbour and adjust ranks to account for this (i.e. w1 == w2
should return a rank of 0).
If M
is a pre-computed distance matrix, the adjustment is only applied if it is also marked as symmetric (because otherwise w1
might not appear in the list of neighbours at all). This might lead to unexpected results once asymmetric measures are implemented in dist.matrix
.
For a sparse pre-computed similarity matrix M
, only non-zero cells are considered as neighbours and all other ranks are set to Inf
. This is consistent with the behaviour of nearest.neighbours
.
pair.distances
is used as a default callback in several evaluation functions, which rely on the attribute similarity
to distinguish between distance measures and similarity scores. For this reason, transformation functions should always be isotonic (order-preserving) so as not to mislead the evaluation procedure.
dist.matrix
, eval.similarity.correlation
, eval.multiple.choice
, nearest.neighbours
transform(RG65, angle=pair.distances(word1, word2, DSM_Vectors))
Run the code above in your browser using DataLab