quanteda (version 4.0.1)

pattern2id: Match patterns against token types

Description

Developer function to match regex, fixed or glob patterns against token types. This allows C++ function to perform fast searches in tokens object. C++ functions use a list of type IDs to construct a hash table, against which sub-vectors of tokens object are matched. This function constructs an index of glob patterns for faster matching.

pattern2fixed converts regex and glob patterns to fixed patterns.

Usage

pattern2id(
  pattern,
  types,
  valuetype = c("glob", "fixed", "regex"),
  case_insensitive = TRUE,
  keep_nomatch = FALSE,
  use_index = TRUE
)

pattern2fixed( pattern, types, valuetype = c("glob", "fixed", "regex"), case_insensitive = TRUE, keep_nomatch = FALSE, use_index = TRUE )

Value

a list of integer vectors containing indices of matched types

pattern2fixed returns a list of character vectors containing types

Arguments

pattern

a character vector, list of character vectors, dictionary, or collocations object. See pattern for details.

types

token types against which patterns are matched

valuetype

the type of pattern matching: "glob" for "glob"-style wildcard expressions; "regex" for regular expressions; or "fixed" for exact matching. See valuetype for details.

case_insensitive

logical; if TRUE, ignore case when matching a pattern or dictionary values

keep_nomatch

keep patterns that did not match

use_index

construct index of types for quick search

Examples

Run this code
types <- c("A", "AA", "B", "BB", "BBB", "C", "CC")

pats_regex <- list(c("^a$", "^b"), c("c"), c("d"))
pattern2id(pats_regex, types, "regex", case_insensitive = TRUE)

pats_glob <- list(c("a*", "b*"), c("c"), c("d"))
pattern2id(pats_glob, types, "glob", case_insensitive = TRUE)

pattern <- list(c("^a$", "^b"), c("c"), c("d"))
types <- c("A", "AA", "B", "BB", "BBB", "C", "CC")
pattern2fixed(pattern, types, "regex", case_insensitive = TRUE)

Run the code above in your browser using DataLab