nlp_tokenize_text

Tokenizes text from a corpus data frame, preserving structure like capitalization and punctuation.

A lightweight toolkit for text retrieval and NLP with a consistent and
predictable API organized around four actions: fetching, reading,
processing, and searching. Functions cover the full pipeline from web
data acquisition to text processing and indexing. Multiple search
strategies are supported including regex, BM25 keyword ranking, cosine
similarity, and dictionary matching. Pipe-friendly with no heavy
dependencies and all outputs are plain data frames. Also useful as a
building block for retrieval-augmented generation pipelines and
autonomous agent workflows.

Jason Timm

textpress

A Lightweight and Versatile NLP Toolkit

nlp_tokenize_text function

<dl><dt>corpus</dt>
<dd>A data frame or data.table containing a <code>text</code> column and the identifiers specified in <code>by</code>.</dd>
<dt>by</dt>
<dd>A character vector of column names used as unique identifiers.
The last column determines the search unit (e.g., if <code>by = c("doc_id", "para_id")</code>,
the search returns matches at the paragraph level).</dd>
<dt>include_spans</dt>
<dd>Logical. Include start/end character spans for each token.</dd>
<dt>method</dt>
<dd>Character. <code>"word"</code> or <code>"biber"</code>.</dd></dl>

Arguments

Tokenize Text Data (mostly) Non-Destructively — nlp_tokenize_text

<dl>

<dt>corpus</dt>
<dd>A data frame or data.table containing a <code>text</code> column and the identifiers specified in <code>by</code>.</dd>


<dt>by</dt>
<dd>A character vector of column names used as unique identifiers.
The last column determines the search unit (e.g., if <code>by = c("doc_id", "para_id")</code>,
the search returns matches at the paragraph level).</dd>


<dt>include_spans</dt>
<dd>Logical. Include start/end character spans for each token.</dd>


<dt>method</dt>
<dd>Character. <code>"word"</code> or <code>"biber"</code>.</dd>

</dl>

nlp_tokenize_text: Tokenize Text Data (mostly) Non-Destructively

Description

Usage

Value

Arguments

Examples