Learn R Programming

regextable (version 0.1.1)

extract: Extract pattern matches from text

Description

Uses a regex lookup table to extract all pattern matches.

Usage

extract(
  data,
  col_name = "text",
  regex_table,
  pattern_col = "pattern",
  data_return_cols = NULL,
  regex_return_cols = NULL,
  date_col = NULL,
  date_start = NULL,
  date_end = NULL,
  remove_acronyms = FALSE,
  do_clean_text = TRUE,
  verbose = TRUE,
  cl = NULL
)

Value

A tibble (data frame) with columns:

  • row_id Integer row identifier corresponding to the input data

  • Additional columns from data if data_return_cols specified

  • Additional columns from regex_table if regex_return_cols specified

  • pattern The matched regex pattern(s)

  • match The specific text extracted from the data (original casing preserved)

Arguments

data

A data frame or character vector containing the text to search.

col_name

Column name in data frame containing text to search through.

regex_table

A regex lookup table with a pattern column.

pattern_col

Name of the regex pattern column in regex_table.

data_return_cols

Optional vector of column names to include from 'data'.

regex_return_cols

Optional vector of column names to include from 'regex_table'.

date_col

Optional column in 'data' for date filtering.

date_start

Optional start date for filtering 'data'.

date_end

Optional end date for filtering 'data'.

remove_acronyms

Logical; if TRUE, removes all-uppercase patterns from regex_table.

do_clean_text

Logical; if TRUE, applies basic text cleaning to the input before matching.

verbose

Logical; if TRUE, displays progress messages.

cl

A cluster object created by parallel::makeCluster(), or an integer to indicate number of child-processes (integer values are ignored on Windows) for parallel evaluations. Passed to pbapply::pblapply().

Details

Pattern matching is performed using R's regular expression engine and is case-insensitive by default. For each input row, the function checks every pattern in regex_table and returns the first match of each pattern.

The output contains one row per pattern match per input row. If multiple patterns match the same text, multiple rows will be returned for that text.

Examples

Run this code
# Create sample data
data <- data.frame(
  id = 1:3,
  text = c("I love apples", "Bananas are great", "Oranges and apples"),
  stringsAsFactors = FALSE
)

# Create regex patterns
patterns <- data.frame(
  pattern = c("apples", "bananas", "oranges"),
  category = c("fruit", "fruit", "fruit")
)

# Extract matches
extract(data, "text", patterns)

Run the code above in your browser using DataLab