Learn R Programming

textpress (version 1.1.0)

search_dict: Exact n-gram matcher (vector of terms)

Description

Find a long list of multi-word expressions (MWEs) or terms without regex overhead or partial-match risks. Tokenize corpus, build n-grams, then exact join against terms. Word boundaries are respected by design. For categories (e.g. term = "R Project", category = "Software"), left_join your metadata onto the result using ngram or term as key.

Usage

search_dict(corpus, by = c("doc_id"), terms, n_min = 1, n_max = 5)

Value

A data.table with id, start, end, n, ngram, term (the matched term from terms).

Arguments

corpus

The text data (data frame or data.table with text and by columns).

by

Identifier columns (e.g. c("doc_id", "sentence_id")).

terms

A character vector of terms/variants to find (e.g. c("United States", "R Project")).

n_min

Integer. Minimum n-gram size (default 1).

n_max

Integer. Maximum n-gram size (default 5).

Examples

Run this code
corpus <- data.frame(doc_id = "1", text = "Gen Z and Millennials use social media.")
search_dict(corpus, by = "doc_id", terms = c("Gen Z", "Millennials", "social media"))

Run the code above in your browser using DataLab