pair.distances: Semantic Distances Between Word Pairs (wordspace)

Description

Compute semantic distances (or similarities) between pairs of target terms based on a scored DSM matrix M, according to any of the distance measures supported by dist.matrix. If one of the terms in a pair is not represented in the DSM, the distance is set to Inf (or to -Inf in the case of a similarity measure).

Usage

pair.distances(w1, w2, M, ..., transform = NULL, 
               rank = c("none", "fwd", "bwd", "avg"),
               avg.method = c("arithmetic", "geometric", "harmonic"),
               batchsize = 10e6, verbose = FALSE)

Value

If rank="none" (the default), a numeric vector of the same length as w1 and w2

specifying the distances or similarities between the term pairs, according to the metric selected with the extra arguments (...).

Otherwise, an integer or numeric vector of the same length as w1 and w2 specifying forward, backward or average neighbour rank for the two terms.

In either case, a distance or rank of Inf (or a similarity of -Inf) is returned for any term pair not represented in the DSM. Attribute similarity is set to TRUE if the returned values are similarity scores rather than distances.

Arguments

w1: a character vector specifying the first term of each pair
w2: a character vector of the same length as w1, specifying the second term of each pair
M: a sparse or dense DSM matrix, suitable for passing to dist.matrix, or an object of class dsm. Alternatively, M can be a pre-computed distance or similarity matrix returned by dist.matrix or marked as such with as.distmat.
...: further arguments are passed to dist.matrix and determine the distance or similarity measure to be used (see dist.matrix for details)
rank: whether to return the distance between the two terms ("none") or the neighbour rank (see “Details” below)
transform: an optional transformation function applied to the distance, similarity or rank values (e.g. transform=log10 for logarithmic ranks). This option is provided as a convenience for evaluation code that calls pair.distances with user-specified arguments.
avg.method: with rank="avg", whether to compute the arithmetic, geometric or harmonic mean of forward and backward rank
batchsize: maximum number of similarity values to compute per batch. This parameter has an essential influence on efficiency and memory use of the algorithm and has to be tuned carefully for optimal performance.
verbose: if TRUE, display some progress messages indicating how data are split into batches

Author

Stephanie Evert (https://purl.org/stephanie.evert)

Details

The rank argument controls whether semantic distance is measured directly by geometric distance (none), by forward neighbour rank (fwd), by backward neighbour rank (bwd), or by the average of forward and backward rank (avg). Forward neighbour rank is the rank of w2 among the nearest neighbours of w1. Backward neighbour rank is the rank of w1 among the nearest neighbours of w2. The average can be computed as an arithmetic, geometric or harmonic mean, depending on avg.method.

Note that a transformation function is applied after averaging. In order to compute the arithmetic mean of log ranks, set transform=log10, rank="avg" and avg.method="geometric".

Neighbour ranks assume that each target term is its own nearest neighbour and adjust ranks to account for this (i.e. w1 == w2 should return a rank of 0). If M is a pre-computed distance matrix, the adjustment is only applied if it is also marked as symmetric (because otherwise w1 might not appear in the list of neighbours at all). This might lead to unexpected results once asymmetric measures are implemented in dist.matrix.

For a sparse pre-computed similarity matrix M, only non-zero cells are considered as neighbours and all other ranks are set to Inf. This is consistent with the behaviour of nearest.neighbours.

pair.distances is used as a default callback in several evaluation functions, which rely on the attribute similarity to distinguish between distance measures and similarity scores. For this reason, transformation functions should always be isotonic (order-preserving) so as not to mislead the evaluation procedure.

Examples

Run this code


transform(RG65, angle=pair.distances(word1, word2, DSM_Vectors))

Run the code above in your browser using DataLab