qdapRegex (version 0.7.2)

regex_supplement: Supplemental Canned Regular Expressions

Description

A dataset containing a list of supplemental, canned regular expressions. The regular expressions in this data set are considered useful but have not been included in a formal function (of the type rm_XXX). Users can utilize the rm_ function to generate functions that can sub/replace/extract as desired.

Usage

data(regex_supplement)

Arguments

Format

A list with 24 elements

Warning

Note that regexes containing %s are replaced by sprintf and are not a valid regex on their own. The S is useful for adding these missing %s parameters.

Details

The following canned regular expressions are included:

after_a

single word after the word "a"

after_the

single word after the word "the"

after_

find single word after ? word (? = user defined); note contains "%s" that is replaced by sprintf and is not a valid regex on its own (user supplies (1) n before, (2) the point, & (3) n after)

around_

find n words (not including punctuation) before or after ? word (? = user defined); note contains "%s" that is replaced by sprintf and is not a valid regex on its own (user supplies (1) n before, (2) the point, & (3) n after)

around2_

find n words (plus punctuation) before or after ? word (? = user defined); note contains "%s" that is replaced by sprintf and is not a valid regex on its own

before_

find sing word before ? word (? = user defined); note contains "%s" that is replaced by sprintf and is not a valid regex on its own

except_first

find all occurrences of a substring except the first; regex pattern retrieved from StackOverflow's akrun: http://stackoverflow.com/a/31458261/1000343

hexadecimal

substring beginning with hash (#) followed by either 3 or 6 select characters (a-f, A-F, and 0-9)

ip_address

substring of four chunks of 1-3 consecutive digits separated with dots (.)

last_occurrence

last occurrence of a delimiter; note contains "%s" that is replaced by sprintf and is not a valid regex on its own (user supplies the delimiter)

pages

substring with "pp." or "p.", optionally followed by a space, followed by 1 or more digits, optionally followed by a dash, optionally followed by 1 or more digits, optionally followed by a semicolon, optionally followed by a space, optionally followed by 1 or more digits; intended for extraction/removal purposes

pages2

substring 1 or more digits, optionally followed by a dash, optionally followed by 1 or more digits, optionally followed by a semicolon, optionally followed by a space, optionally followed by 1 or more digits; intended for validation purposes

punctuation

punctuation characters ([:punct:]) with the ability to negate; note contains "%s" that is replaced by sprintf and is not a valid regex on its own

run_split

a regex that is useful for splitting strings in the characters runs (e.g., "wwxyyyzz" becomes "ww", "x", "yyy", "zz"); regex pattern retrieved from Robert Redd: http://stackoverflow.com/a/29383435/1000343

split_keep_delim

regex string that splits on a delimiter and retains the delimiter

thousands_separator

chunks digits > 4 into groups of 3 from right to left allowing for easy insertion of thousands separator; regex pattern retrieved from StackOverflow's stema: http://stackoverflow.com/a/10612685/1000343

time_12_hours

substring of valid hours (1-12) followed by a colon (:) followed by valid minutes (0-60), followed by an optional space and the character chunk am or pm

version

substring starting with "v" or "version" optionally followed by a space and then period separated digits for <major>.<minor>.<release>.<build>; the build sequence is optional and the "version"/"v" IS NOT contained in the substring

version2

substring starting with "v" or "version" optionally followed by a space and then period separated digits for <major>.<minor>.<release>.<build>; the build sequence is optional and the "version"/"v" IS contained in the substring

white_after_comma

substring of white space after a comma

word_boundary

A true word boundary that only includes alphabetic characters; based on www.rexegg.com's suggestion taken from discussion of true word boundaries; note contains "%s" that is replaced by sprintf and is not a valid regex on its own

word_boundary_left

A true left word boundary that only includes alphabetic characters; based on www.rexegg.com's suggestion taken from discussion of true word boundaries

word_boundary_right

A true right word boundary that only includes alphabetic characters; based on www.rexegg.com's suggestion taken from discussion of true word boundaries

youtube_id

substring of the video id from a YouTube video; taken from Jacob Overgaard's submission found https://regex101.com/r/kU7bP8/1

Regexes from this data set can be added to the pattern argument of any rm_XXX function via an at sign (@) followed by a regex name from this data set (e.g., pattern = "@after_the") provided the regular expression does not contain non-regex such as sprintf character string %s.

Use qdapRegex:::examine_regex(regex_supplement) to interactively explore the regular expressions in regex_usa. This will provide a browser + console based break down of each regex in the dictionary.