token_index

Build an inverted index from tokens to the documents that contain them. This
is useful for finding document pairs that share one or more n-grams without
comparing every document pair. The corpus must be created with
<code>keep_tokens = TRUE</code>.

Tools for measuring similarity among documents and detecting
passages which have been reused. Implements shingled n-gram, skip n-gram,
and other tokenizers; similarity/dissimilarity functions; pairwise
comparisons; minhash and locality sensitive hashing algorithms; and a
version of the Smith-Waterman local alignment algorithm suitable for
natural language.

Yaoxiang Li

textreuse

Detect Text Reuse and Document Similarity

Lincoln Mullen

token_index function

<dl><dt>corpus</dt>
<dd>A <code>TextReuseCorpus</code> with retained tokens.</dd>
<dt>min_doc_count</dt>
<dd>Minimum number of documents a token must appear in to
be retained. Increase this to remove rare tokens.</dd>
<dt>max_doc_count</dt>
<dd>Maximum number of documents a token may appear in to be
retained. Decrease this to remove very common tokens.</dd></dl>

Arguments

Build an index of tokens and documents — token_index

<dl>

<dt>corpus</dt>
<dd>A <code>TextReuseCorpus</code> with retained tokens.</dd>


<dt>min_doc_count</dt>
<dd>Minimum number of documents a token must appear in to
be retained. Increase this to remove rare tokens.</dd>


<dt>max_doc_count</dt>
<dd>Maximum number of documents a token may appear in to be
retained. Decrease this to remove very common tokens.</dd>

</dl>

token_index: Build an index of tokens and documents

Description

Usage

Value

Arguments