textcat_xdist: Cross-Distances Between $N$-Gram Profiles

Description

Compute cross-distances between collections of $n$-gram profiles.

Usage

textcat_xdist(x, p = NULL, method = "CT", ..., options = list())

Arguments

a textcat profile db (see textcat_profile_db), or an R object of text documents extractable via as.character.

NULL (default), or as for x. The default is equivalent to taking p as x (but more efficient).

method

a character string specifying a built-in method, or a user-defined function for computing distances between $n$-gram profiles, or NULL (corresponding to the current value of textcat option xdist_method (see textcat_options). See Details for available built-in methods.

...

options to be passed to the method for computing distances.

options

a list of such options.

Details

If x (or p) is not a profile db, the $n$-gram profiles of the individual text documents extracted from it are computed using the profile method and options in p if this is a profile db, and using the current textcat profile method and options otherwise.

Currently, the following distance methods for $n$-gram profiles are available.

"CT":: the out-of-place measure of Cavnar and Trenkle.

"ranks":

a variant of the Cavnar/Trenkle measure based on the aggregated absolute difference of the ranks of the combined $n$-grams in the two profiles.

"ALPD":

the sum of the absolute differences in $n$-gram log frequencies.

"KLI":

the Kullback-Leibler I-divergence $I(p, q) = sum_i p_i log(p_i/q_i)$ of the $n$-gram frequency distributions $p$ and $q$ of the two profiles.

"KLJ":

the Kullback-Leibler J-divergence $J(p, q) = sum_i (p_i - q_i) log(p_i/q_i)$, the symmetrized variant $I(p, q) + I(q, p)$ of the I-divergences.

"JS":

the Jensen-Shannon divergence between the $n$-gram frequency distributions.

"cosine"

the cosine dissimilarity between the profiles, i.e., one minus the inner product of the frequency vectors normalized to Euclidean length one (and filled with zeros for entries missing in one of the vectors).

"Dice"

the Dice dissimilarity, i.e., the fraction of $n$-grams present in one of the profiles only.

For the measures based on distances of frequency distributions, $n$-grams of the two profiles are combined, and missing $n$-grams are given a small positive absolute frequency which can be controlled by option eps, and defaults to 1e-6.

Options given in ... and options are combined, and merged with the default xdist options specified by the textcat option xdist_options using exact name matching.

Examples

Run this code

## Compute cross-distances between the TextCat byte profiles using the
## CT out-of-place measure.
d <- textcat_xdist(TC_byte_profiles)
## Visualize results of hierarchical cluster analysis on the distances.
plot(hclust(as.dist(d)), cex = 0.7)

Run the code above in your browser using DataLab