Learn R Programming

tall (version 0.5.2)

txt_recode_ngram_fast: Fast n-gram recoding for multiword detection

Description

Efficiently combines consecutive tokens into multiword expressions using C++. This function scans text sequentially to identify and merge n-gram patterns.

Usage

txt_recode_ngram_fast(x, compound, ngram, sep = " ")

Value

Character vector where matched n-grams are combined and subsequent tokens (that were merged) are set to NA

Arguments

x

Character vector of tokens (e.g., lemmas or tokens)

compound

Character vector of multiword expressions to match

ngram

Integer vector indicating the length of each compound

sep

String separator to use when joining tokens (default: " ")

Details

When a multiword match is found:

  • The first position gets the combined multiword expression

  • Subsequent positions that were merged are set to NA

The function checks n-grams from longest to shortest to prioritize longer matches.

Performance: ~80-150x faster than pure R implementation for typical text data.

Examples

Run this code
tokens <- c("machine", "learning", "is", "cool", "machine", "learning")
compounds <- c("machine learning")
ngrams <- c(2)
txt_recode_ngram_fast(tokens, compounds, ngrams, " ")
# Returns: c("machine learning", NA, "is", "cool", "machine learning", NA)

Run the code above in your browser using DataLab