Usage
delete.duplicates(dtm, meta, id.var = "document_id", date.var = "date", source.var = "source", hour.window = c(-24, 24), measure = "cosine", similarity = 1, keep = "first", tf.idf = FALSE)
Arguments
meta
A data.frame where rows are documents and columns are document meta information.
Should contain 3 columns: the document name/id, date and source.
The name/id column should match the document names/ids of the edgelist, and its label is specified in the `id.var` argument.
The date column should be intepretable with as.POSIXct, and its label is specified in the `date.var` argument.
The source column is specified in the `date.var` argument. id.var
The label for the document name/id column in the `meta` data.frame. Default is "document_id"
date.var
The label for the document date column in the `meta` data.frame . default is "date"
source.var
The label for the document date column in the `meta` data.frame . default is "source"
hour.window
A vector of length 2, in which the first and second value determine the left and right side of the window, respectively. By default c(-24,24), which compares each document to all other documents within a 24 hour time distance.
measure
the measure that should be used to calculate similarity/distance/adjacency. Currently supports the symmetrical measure "cosine" (cosine similarity), and the assymetrical measures "overlap_pct" (percentage of term scores in the document that also occur in the other document).
similarity
a threshold for similarity. Documents of which similarity is equal or higher are deleted
keep
A character indicating whether to keep the 'first' or 'last' published of duplicate documents.
tf.idf
if TRUE, weight the dtm with tf.idf before comparing documents. The original (non-weighted) DTM is returned.