TextReuseTextDocument objects.
This class is used for comparing documents.TextReuseTextDocument(text, file = NULL, meta = list(),
tokenizer = tokenize_ngrams, ..., hash_func = hash_string,
minhash_func = NULL, keep_tokens = FALSE, keep_text = TRUE,
skip_short = TRUE)is.TextReuseTextDocument(x)
has_content(x)
has_tokens(x)
has_hashes(x)
has_minhashes(x)
file.text is not provided.text parameter, then
you must provide an id field, e.g., meta = list(id =
"my_id"). If the document is cretokenizers. If value is NULL, then tokenizing and
hashing will be skipped.tokenizer.hash_string.minhash_generator.TextReuseTextDocument. This object inherits
from the virtual S3 class TextDocument in the NLP
package. It contains the following elements: [object Object],[object Object],[object Object],[object Object],[object Object]FALSE to
keep_tokens and keep_text, you can avoid saving those
objects, which can result in significant memory savings for large corpora. If skip_short = TRUE, this function will return NULL for very
short or empty documents. A very short document is one where there are two
few words to create at least two n-grams. For example, if five-grams are
desired, then a document must be at least six words long. If no value of
n is provided, then the function assumes a value of n = 3. A
warning will be printed with the document ID of a skipped document.
file <- system.file("extdata/legal/ny1850-match.txt", package = "textreuse")
doc <- TextReuseTextDocument(file = file, meta = list(id = "ny1850"))
print(doc)
meta(doc)
head(tokens(doc))
head(hashes(doc))
content(doc)Run the code above in your browser using DataLab