regex_supplement: Supplemental Canned Regular Expressions

Description

A dataset containing a list of supplemental, canned regular expressions. The regular expressions in this data set are considered useful but have not been included in a formal function (of the type rm_XXX). Users can utilize the rm_ function to generate functions that can sub/replace/extract as desired.

Usage

data(regex_supplement)

Arguments

Format

A list with 24 elements

Warning

Note that regexes containing %s are replaced by sprintf and are not a valid regex on their own. The S is useful for adding these missing %s parameters.

Details

The following canned regular expressions are included:

after_a: single word after the word "a"
after_the: single word after the word "the"
after_: find single word after ? word (? = user defined); note contains "%s" that is replaced by sprintf and is not a valid regex on its own (user supplies (1) n before, (2) the point, & (3) n after)
around_: find n words (not including punctuation) before or after ? word (? = user defined); note contains "%s" that is replaced by sprintf and is not a valid regex on its own (user supplies (1) n before, (2) the point, & (3) n after)
around2_: find n words (plus punctuation) before or after ? word (? = user defined); note contains "%s" that is replaced by sprintf and is not a valid regex on its own
before_: find sing word before ? word (? = user defined); note contains "%s" that is replaced by sprintf and is not a valid regex on its own
except_first: find all occurrences of a substring except the first; regex pattern retrieved from StackOverflow's akrun: http://stackoverflow.com/a/31458261/1000343
hexadecimal: substring beginning with hash (#) followed by either 3 or 6 select characters (a-f, A-F, and 0-9)
ip_address: substring of four chunks of 1-3 consecutive digits separated with dots (.)
last_occurrence: last occurrence of a delimiter; note contains "%s" that is replaced by sprintf and is not a valid regex on its own (user supplies the delimiter)
pages: substring with "pp." or "p.", optionally followed by a space, followed by 1 or more digits, optionally followed by a dash, optionally followed by 1 or more digits, optionally followed by a semicolon, optionally followed by a space, optionally followed by 1 or more digits; intended for extraction/removal purposes
pages2: substring 1 or more digits, optionally followed by a dash, optionally followed by 1 or more digits, optionally followed by a semicolon, optionally followed by a space, optionally followed by 1 or more digits; intended for validation purposes
punctuation: punctuation characters ([:punct:]) with the ability to negate; note contains "%s" that is replaced by sprintf and is not a valid regex on its own
run_split: a regex that is useful for splitting strings in the characters runs (e.g., "wwxyyyzz" becomes "ww", "x", "yyy", "zz"); regex pattern retrieved from Robert Redd: http://stackoverflow.com/a/29383435/1000343
split_keep_delim: regex string that splits on a delimiter and retains the delimiter
thousands_separator: chunks digits > 4 into groups of 3 from right to left allowing for easy insertion of thousands separator; regex pattern retrieved from StackOverflow's stema: http://stackoverflow.com/a/10612685/1000343
time_12_hours: substring of valid hours (1-12) followed by a colon (:) followed by valid minutes (0-60), followed by an optional space and the character chunk am or pm
version: substring starting with "v" or "version" optionally followed by a space and then period separated digits for <major>.<minor>.<release>.<build>; the build sequence is optional and the "version"/"v" IS NOT contained in the substring
version2: substring starting with "v" or "version" optionally followed by a space and then period separated digits for <major>.<minor>.<release>.<build>; the build sequence is optional and the "version"/"v" IS contained in the substring
white_after_comma: substring of white space after a comma
word_boundary: A true word boundary that only includes alphabetic characters; based on www.rexegg.com's suggestion taken from discussion of true word boundaries; note contains "%s" that is replaced by sprintf and is not a valid regex on its own
word_boundary_left: A true left word boundary that only includes alphabetic characters; based on www.rexegg.com's suggestion taken from discussion of true word boundaries
word_boundary_right: A true right word boundary that only includes alphabetic characters; based on www.rexegg.com's suggestion taken from discussion of true word boundaries
youtube_id: substring of the video id from a YouTube video; taken from Jacob Overgaard's submission found https://regex101.com/r/kU7bP8/1

Regexes from this data set can be added to the pattern argument of any rm_XXX function via an at sign (@) followed by a regex name from this data set (e.g., pattern = "@after_the") provided the regular expression does not contain non-regex such as sprintf character string %s.

Use qdapRegex:::examine_regex(regex_supplement) to interactively explore the regular expressions in regex_usa. This will provide a browser + console based break down of each regex in the dictionary.