delete.duplicates: Delete duplicate (or similar) documents from a document term matrix

Description

Delete duplicate (or similar) documents from a document term matrix. Duplicates are defined by: having high content similarity, occuring within a given time distance and being published by the same source.

Usage

delete.duplicates(dtm, meta, id.var = "document_id", date.var = "date", source.var = "source", hour.window = c(-24, 24), measure = "cosine", similarity = 1, keep = "first", tf.idf = FALSE)

Arguments

dtm

A document-term matrix in the tm DocumentTermMatrix class. It is recommended to weight the DTM beforehand, for instance using weightTfIdf.

meta

A data.frame where rows are documents and columns are document meta information. Should contain 3 columns: the document name/id, date and source. The name/id column should match the document names/ids of the edgelist, and its label is specified in the `id.var` argument. The date column should be intepretable with as.POSIXct, and its label is specified in the `date.var` argument. The source column is specified in the `date.var` argument.

id.var

The label for the document name/id column in the `meta` data.frame. Default is "document_id"

date.var

The label for the document date column in the `meta` data.frame . default is "date"

source.var

The label for the document date column in the `meta` data.frame . default is "source"

hour.window

A vector of length 2, in which the first and second value determine the left and right side of the window, respectively. By default c(-24,24), which compares each document to all other documents within a 24 hour time distance.

measure

the measure that should be used to calculate similarity/distance/adjacency. Currently supports the symmetrical measure "cosine" (cosine similarity), and the assymetrical measures "overlap_pct" (percentage of term scores in the document that also occur in the other document).

similarity

a threshold for similarity. Documents of which similarity is equal or higher are deleted

keep

A character indicating whether to keep the 'first' or 'last' published of duplicate documents.

tf.idf

if TRUE, weight the dtm with tf.idf before comparing documents. The original (non-weighted) DTM is returned.

Value

A dtm with the duplicate documents deleted

Details

Note that this can also be used to delete "updates" of articles (e.g., on news sites, news agencies). This should be considered if the temporal order of publications is relevant for the analysis.

Examples

Run this code

data(dtm)
data(meta)

## example with very low similarity threshold (normally not recommended!)
dtm2 = delete.duplicates(dtm, meta, similarity = 0.5, keep='first', tf.idf = TRUE)

Run the code above in your browser using DataLab