Learn R Programming

quanteda (version 0.9.2-0)

similarity: compute similarities between documents and/or features

Description

Compute similarities between documents and/or features from a dfm. Uses the similarity measures defined in simil. See pr_DB for available distance measures, or how to create your own.

Usage

similarity(x, selection = NULL, n = NULL, margin = c("features",
  "documents"), method = "correlation", sorted = TRUE, normalize = FALSE)

## S3 method for class 'dfm': similarity(x, selection = NULL, n = NULL, margin = c("features", "documents"), method = "correlation", sorted = TRUE, normalize = FALSE)

## S3 method for class 'similMatrix': as.matrix(x, ...)

## S3 method for class 'similMatrix': print(x, digits = 4, ...)

Arguments

x
a dfm object
selection
character or character vector of document names or feature labels from the dfm
n
the top n most similar items will be returned, sorted in descending order. If n is NULL, return all items.
margin
identifies the margin of the dfm on which similarity will be computed: features for word/term features or documents for documents.
method
a valid method for computing similarity from pr_DB
sorted
sort results in descending order if TRUE
normalize
if TRUE, normalize the dfm by term frequency within document (so that the dfm values will be relative term frequency within each document)
...
unused
digits
decimal places to display similarity values

Value

  • a named list of the selection labels, with a sorted named vector of similarity measures.

Examples

Run this code
# create a dfm from inaugural addresses from Reagan onwards
presDfm <- dfm(subset(inaugCorpus, Year > 1980), ignoredFeatures = stopwords("english"),
               stem = TRUE)

# compute some document similarities
(tmp <- similarity(presDfm, margin = "documents"))
# output as a matrix
as.matrix(tmp)
# for specific comparisons
similarity(presDfm, "1985-Reagan", n = 5, margin = "documents")
similarity(presDfm, c("2009-Obama" , "2013-Obama"), n = 5, margin = "documents")
similarity(presDfm, c("2009-Obama" , "2013-Obama"), margin = "documents")
similarity(presDfm, c("2009-Obama" , "2013-Obama"), margin = "documents", method = "cosine")
similarity(presDfm, "2005-Bush", margin = "documents", method = "eJaccard", sorted = FALSE)

# compute some term similarities
similarity(presDfm, c("fair", "health", "terror"), method="cosine")

# compare to tm
require(tm)
data("crude")
crude <- tm_map(crude, content_transformer(tolower))
crude <- tm_map(crude, removePunctuation)
crude <- tm_map(crude, removeNumbers)
crude <- tm_map(crude, stemDocument)
tdm <- TermDocumentMatrix(crude)
findAssocs(tdm, c("oil", "opec", "xyz"), c(0.75, 0.82, 0.1))
# in quanteda
quantedaDfm <- new("dfmSparse", Matrix(t(as.matrix(tdm))))
similarity(quantedaDfm, c("oil", "opec", "xyz"), n = 14)
corMat <- as.matrix(proxy::simil(as.matrix(quantedaDfm), by_rows = FALSE))
round(head(sort(corMat[, "oil"], decreasing = TRUE), 14), 2)
round(head(sort(corMat[, "opec"], decreasing = TRUE), 9), 2)

Run the code above in your browser using DataLab