extract: Extract pattern matches from text

Description

Uses a regex lookup table to extract all pattern matches.

Usage

extract(
  data,
  col_name = "text",
  regex_table,
  pattern_col = "pattern",
  data_return_cols = NULL,
  regex_return_cols = NULL,
  date_col = NULL,
  date_start = NULL,
  date_end = NULL,
  remove_acronyms = FALSE,
  do_clean_text = TRUE,
  verbose = TRUE,
  cl = NULL
)

Value

A tibble (data frame) with columns:

row_id Integer row identifier corresponding to the input data
Additional columns from data if data_return_cols specified
Additional columns from regex_table if regex_return_cols specified
pattern The matched regex pattern(s)
match The specific text extracted from the data (original casing preserved)

Arguments

data: A data frame or character vector containing the text to search.
col_name: Column name in data frame containing text to search through.
regex_table: A regex lookup table with a pattern column.
pattern_col: Name of the regex pattern column in regex_table.
data_return_cols: Optional vector of column names to include from 'data'.
regex_return_cols: Optional vector of column names to include from 'regex_table'.
date_col: Optional column in 'data' for date filtering.
date_start: Optional start date for filtering 'data'.
date_end: Optional end date for filtering 'data'.
remove_acronyms: Logical; if TRUE, removes all-uppercase patterns from regex_table.
do_clean_text: Logical; if TRUE, applies basic text cleaning to the input before matching.
verbose: Logical; if TRUE, displays progress messages.
cl: A cluster object created by parallel::makeCluster(), or an integer to indicate number of child-processes (integer values are ignored on Windows) for parallel evaluations. Passed to pbapply::pblapply().

Details

Pattern matching is performed using R's regular expression engine and is case-insensitive by default. For each input row, the function checks every pattern in regex_table and returns the first match of each pattern.

The output contains one row per pattern match per input row. If multiple patterns match the same text, multiple rows will be returned for that text.

Examples

Run this code

# Create sample data
data <- data.frame(
  id = 1:3,
  text = c("I love apples", "Bananas are great", "Oranges and apples"),
  stringsAsFactors = FALSE
)

# Create regex patterns
patterns <- data.frame(
  pattern = c("apples", "bananas", "oranges"),
  category = c("fruit", "fruit", "fruit")
)

# Extract matches
extract(data, "text", patterns)

Run the code above in your browser using DataLab