Learn R Programming

textreuse (version 1.0.1)

token_index: Build an index of tokens and documents

Description

Build an inverted index from tokens to the documents that contain them. This is useful for finding document pairs that share one or more n-grams without comparing every document pair. The corpus must be created with keep_tokens = TRUE.

Usage

token_index(corpus, min_doc_count = 2, max_doc_count = Inf)

Value

A textreuse_token_index data frame with columns token,

docs, and n_docs.

Arguments

corpus

A TextReuseCorpus with retained tokens.

min_doc_count

Minimum number of documents a token must appear in to be retained. Increase this to remove rare tokens.

max_doc_count

Maximum number of documents a token may appear in to be retained. Decrease this to remove very common tokens.