TextReuseTextDocument

is.TextReuseTextDocument

has_content

has_tokens

has_hashes

has_minhashes

A character vector containing the text of the document. This
argument can be skipped if supplying <code>file</code>.

text

The path to a text file, if <code>text</code> is not provided.

file

A list with named elements for the metadata associated with this
document. If a document is created using the <code>text</code> parameter, then
you must provide an <code>id</code> field, e.g., <code>meta = list(id =
"my_id")</code>. If the document is created using <code>file</code>, then the ID will
be created from the file name.

tokenizer

Arguments passed on to the <code>tokenizer</code>.

A function to hash the tokens. See
<code><a rd-options="" href="/link/hash_string?package=textreuse&version=0.1.5" data-mini-rdoc="textreuse::hash_string">hash_string</a></code>.

hash_func

A function to create minhash signatures of the document.
See <code><a rd-options="" href="/link/minhash_generator?package=textreuse&version=0.1.5" data-mini-rdoc="textreuse::minhash_generator">minhash_generator</a></code>.

minhash_func

Should the tokens be saved in the document that is
returned or discarded?

keep_tokens

Should the text be saved in the document that is returned or
discarded?

keep_text

Should short documents be skipped? (See details.)

skip_short

This is the constructor function for <code>TextReuseTextDocument</code> objects.
This class is used for comparing documents.

Tools for measuring similarity among documents and detecting
passages which have been reused. Implements shingled n-gram, skip n-gram,
and other tokenizers; similarity/dissimilarity functions; pairwise
comparisons; minhash and locality sensitive hashing algorithms; and a
version of the Smith-Waterman local alignment algorithm suitable for
natural language.

Lincoln Mullen

textreuse

Detect Text Reuse and Document Similarity

TextReuseTextDocument function

A function to split the text into tokens. See
<code><a rd-options='' href='tokenizers'>tokenizers</a></code>. If value is <code>NULL</code>, then tokenizing and
hashing will be skipped.

A function to hash the tokens. See
<code><a rd-options='' href='hash_string'>hash_string</a></code>.

A function to create minhash signatures of the document.
See <code><a rd-options='' href='minhash_generator'>minhash_generator</a></code>.

TextReuseTextDocument: TextReuseTextDocument

Description

Usage

Arguments

Value

Details

See Also

Examples