textstat_simil: Similarity and distance computation between documents or features

Description

These functions compute matrixes of distances and similarities between documents or features from a dfm and return a dist object (or a matrix if specific targets are selected). They are fast and robust because they operate directly on the sparse dfm objects.

Usage

textstat_simil(x, selection = NULL, margin = c("documents",
  "features"), method = c("correlation", "cosine", "jaccard", "ejaccard",
  "dice", "edice", "hamman", "simple matching", "faith"), upper = FALSE,
  diag = FALSE)
textstat_dist(x, selection = NULL, margin = c("documents", "features"),
  method = c("euclidean", "kullback", "manhattan", "maximum", "canberra",
  "minkowski"), upper = FALSE, diag = FALSE, p = 2)

Arguments

a dfm object

selection

a valid index for document or feature names (depending on margin) from x, to be selected for comparison

margin

identifies the margin of the dfm on which similarity or difference will be computed: "documents" for documents or "features" for word/term features.

method

method the similarity or distance measure to be used; see Details.

upper

whether the upper triangle of the symmetric $V \times V$ matrix is recorded. Only used when value = "dist".

diag

whether the diagonal of the distance matrix should be recorded. . Only used when value = "dist".

The power of the Minkowski distance.

Value

By default, textstat_simil and textstat_dist return dist class objects if selection is NULL, otherwise, a matrix is returned matching distances to the documents or features identified in the selection.

These can be transformed into a list format using as.list.dist, if that format is preferred.

Details

textstat_simil options are: "correlation" (default), "cosine", "jaccard", "ejaccard", "dice", "edice", "simple matching", "hamman", and "faith".

textstat_dist options are: "euclidean" (default), "kullback". "manhattan", "maximum", "canberra", and "minkowski".

References

"kullback" is the Kullback-Leibler distance, which assumes that $P(x_i) = 0$ implies $P(y_i)=0$, and in case either $P(x_i)$ or $P(y_i)$ equals to zero, then $P(x_i) * log(p(x_i)/p(y_i))$ is assumed to be zero as the limit value. The formula is: $$\sum{P(x)*log(P(x)/p(y))}$$

All other measures are described in the proxy package.

Examples

Run this code

# NOT RUN {
# similarities for documents
dfmat <- dfm(corpus_subset(data_corpus_inaugural, Year > 1980), 
          remove_punct = TRUE, remove = stopwords("english"))
(tstat1 <- textstat_simil(dfmat, method = "cosine", margin = "documents"))
as.matrix(tstat1)
as.list(tstat1)

# similarities for for specific documents
textstat_simil(dfmat, selection = "2017-Trump", margin = "documents")
textstat_simil(dfmat, selection = "2017-Trump", method = "cosine", margin = "documents")
textstat_simil(dfmat, selection = c("2009-Obama" , "2013-Obama"), margin = "documents")

# compute some term similarities
tstat2 <- textstat_simil(dfmat, selection = c("fair", "health", "terror"), method = "cosine",
                      margin = "features")
head(as.matrix(tstat2), 10)
as.list(tstat2, n = 8)

# create a dfm from inaugural addresses from Reagan onwards
dfmat <- dfm(corpus_subset(data_corpus_inaugural, Year > 1990), 
               remove = stopwords("english"), stem = TRUE, remove_punct = TRUE)
               
# distances for documents 
(tstat1 <- textstat_dist(dfmat, margin = "documents"))
as.matrix(tstat1)

# distances for specific documents
textstat_dist(dfmat, "2017-Trump", margin = "documents")
(tstat2 <- textstat_dist(dfmat, c("2009-Obama" , "2013-Obama"), margin = "documents"))
as.list(tstat2)

# }

Run the code above in your browser using DataLab