Learn R Programming

textreuse (version 0.1.3)

TextReuseTextDocument: TextReuseTextDocument

Description

This is the constructor function for TextReuseTextDocument objects. This class is used for comparing documents.

Usage

TextReuseTextDocument(text, file = NULL, meta = list(),
  tokenizer = tokenize_ngrams, ..., hash_func = hash_string,
  minhash_func = NULL, keep_tokens = FALSE, keep_text = TRUE,
  skip_short = TRUE)

is.TextReuseTextDocument(x)

has_content(x)

has_tokens(x)

has_hashes(x)

has_minhashes(x)

Arguments

Value

  • An object of class TextReuseTextDocument. This object inherits from the virtual S3 class TextDocument in the NLP package. It contains the following elements: [object Object],[object Object],[object Object],[object Object],[object Object]

Details

This constructor function follows a three-step process. It reads in the text, either from a file or from memory. It then tokenizes that text. Then it hashes the tokens. Most of the comparison functions in this package rely only on the hashes to make the comparison. By passing FALSE to keep_tokens and keep_text, you can avoid saving those objects, which can result in significant memory savings for large corpora.

If skip_short = TRUE, this function will return NULL for very short or empty documents. A very short document is one where there are two few words to create at least two n-grams. For example, if five-grams are desired, then a document must be at least six words long. If no value of n is provided, then the function assumes a value of n = 3. A warning will be printed with the document ID of a skipped document.

See Also

Accessors for TextReuse objects.

Examples

Run this code
file <- system.file("extdata/legal/ny1850-match.txt", package = "textreuse")
doc  <- TextReuseTextDocument(file = file, meta = list(id = "ny1850"))
print(doc)
meta(doc)
head(tokens(doc))
head(hashes(doc))
content(doc)

Run the code above in your browser using DataLab