documents.compare: Compare the documents in two corpora/dtms

Description

Compare the documents in corpus dtm.x with reference corpus dtm.y.

Usage

documents.compare(dtm, dtm.y = NULL, measure = "cosine", min.similarity = 0, n.topsim = NULL, return.zeros = FALSE)

Arguments

dtm

A document-term matrix in the tm DocumentTermMatrix class. It is recommended to weight the DTM beforehand, for instance using weightTfIdf.

dtm.y

Optional. If given, documents from dtm will only be compared to the documents in dtm.y

measure

the measure that should be used to calculate similarity/distance/adjacency. Currently supports the symmetrical measure "cosine", for cosine similarity. Also supports assymetrical measures "percentage.from" and "percentage.to" for the percentage of overlapping terms (term scores taken into account). Here "percentage.from" gives the percentage of the document that is compared to the other, whereas "percentage.to" gives the percentage of the document to which is compared.

min.similarity

a threshold for similarity. lower values are deleted. Set to 0 by default.

n.topsim

An alternative or additional sort of threshold for similarity. Only keep the [n.topsim] highest similarity scores for x. Can return more than [n.topsim] similarity scores in the case of duplicate similarities.

return.zeros

If true, all comparison results are returned, including those with zero similarity (rarely usefull and problematic with large data)

Value

A data frame with pairs of documents and their similarities.

Details

The calculation of document similarity is performed using a vector space model approach. Inner-product based similarity measures are used, such as cosine similarity. It is recommended to weight the DTM beforehand, for instance using Term frequency-inverse document frequency (tf.idf)

Examples

Run this code

data(dtm)

comp = documents.compare(dtm, min.similarity=0.4)
head(comp)

Run the code above in your browser using DataLab