nlp_split_sentences

This function splits text from a data frame into individual sentences based on specified columns and handles abbreviations effectively.

A lightweight toolkit for text retrieval and NLP with a consistent and
predictable API organized around four actions: fetching, reading,
processing, and searching. Functions cover the full pipeline from web
data acquisition to text processing and indexing. Multiple search
strategies are supported including regex, BM25 keyword ranking, cosine
similarity, and dictionary matching. Pipe-friendly with no heavy
dependencies and all outputs are plain data frames. Also useful as a
building block for retrieval-augmented generation pipelines and
autonomous agent workflows.

Jason Timm

textpress

A Lightweight and Versatile NLP Toolkit

nlp_split_sentences function

<dl><dt>corpus</dt>
<dd>A data frame or data.table containing a <code>text</code> column and the identifiers specified in <code>by</code>.</dd>
<dt>by</dt>
<dd>A character vector of column names used as unique identifiers.
The last column determines the search unit (e.g., if <code>by = c("doc_id", "para_id")</code>,
the search returns matches at the paragraph level).</dd>
<dt>abbreviations</dt>
<dd>A character vector of abbreviations to handle during sentence splitting, defaults to textpress::abbreviations.</dd></dl>

Arguments

Split Text into Sentences — nlp_split_sentences

<dl>

<dt>corpus</dt>
<dd>A data frame or data.table containing a <code>text</code> column and the identifiers specified in <code>by</code>.</dd>


<dt>by</dt>
<dd>A character vector of column names used as unique identifiers.
The last column determines the search unit (e.g., if <code>by = c("doc_id", "para_id")</code>,
the search returns matches at the paragraph level).</dd>


<dt>abbreviations</dt>
<dd>A character vector of abbreviations to handle during sentence splitting, defaults to textpress::abbreviations.</dd>

</dl>

nlp_split_sentences: Split Text into Sentences

Description

Usage

Value

Arguments

Examples