search_dict

Find a long list of multi-word expressions (MWEs) or terms without regex
overhead or partial-match risks. Tokenize corpus, build n-grams, then exact
join against <code>terms</code>. Word boundaries are respected by design. For
categories (e.g. term = "R Project", category = "Software"), left_join your
metadata onto the result using <code>ngram</code> or <code>term</code> as key.

A lightweight toolkit for text retrieval and NLP with a consistent and
predictable API organized around four actions: fetching, reading,
processing, and searching. Functions cover the full pipeline from web
data acquisition to text processing and indexing. Multiple search
strategies are supported including regex, BM25 keyword ranking, cosine
similarity, and dictionary matching. Pipe-friendly with no heavy
dependencies and all outputs are plain data frames. Also useful as a
building block for retrieval-augmented generation pipelines and
autonomous agent workflows.

Jason Timm

textpress

A Lightweight and Versatile NLP Toolkit

search_dict function

<dl><dt>corpus</dt>
<dd>The text data (data frame or data.table with <code>text</code> and <code>by</code> columns).</dd>
<dt>by</dt>
<dd>Identifier columns (e.g. <code>c("doc_id", "sentence_id")</code>).</dd>
<dt>terms</dt>
<dd>A character vector of terms/variants to find (e.g. <code>c("United States", "R Project")</code>).</dd>
<dt>n_min</dt>
<dd>Integer. Minimum n-gram size (default 1).</dd>
<dt>n_max</dt>
<dd>Integer. Maximum n-gram size (default 5).</dd></dl>

Arguments

Exact n-gram matcher (vector of terms) — search_dict

<dl>

<dt>corpus</dt>
<dd>The text data (data frame or data.table with <code>text</code> and <code>by</code> columns).</dd>


<dt>by</dt>
<dd>Identifier columns (e.g. <code>c("doc_id", "sentence_id")</code>).</dd>


<dt>terms</dt>
<dd>A character vector of terms/variants to find (e.g. <code>c("United States", "R Project")</code>).</dd>


<dt>n_min</dt>
<dd>Integer. Minimum n-gram size (default 1).</dd>


<dt>n_max</dt>
<dd>Integer. Maximum n-gram size (default 5).</dd>

</dl>

search_dict: Exact n-gram matcher (vector of terms)

Description

Usage

Value

Arguments

Examples