Detect/Locate potential issues with text data. This family of functions generates a list of detections/location functions that can be accessed via the dollar sign or square bracket operators. Accessible functions include:
which_are()is_it()
which_are
returns an environment of functions that can be used to
locate and return the integer locations of the particular non-normalized text
named by the function.
is_it
returns an environment of functions that can be used to
detect and return a logical atomic vector of equal length to the input vector
(except for meta functions) of the particular non-normalized text
named by the function.
Contains contractions
Contains dates
Contains digits
Contains email addresses
Contains emoticons
Contains just white space
Contains escaped backslash character
Contains Twitter style hash tags
Contains html mark-up
Contains incomplete sentences (e.g., ends with ...)
Contains kerning (e.g. "The B O M B!")
Is a list of atomic vectors (Not provided by which_are
))
Contains potentially misspelled words
Contains a sentence with no ending punctuation
Contains commas with no space after them
Contains non-ASCII characters
Is a non-character vector (Not provided by which_are
))
Contains non split sentences
Contains a Twitter style handle used to tag others (use of the at symbol)
Contains a time stamp
Contains a URL
The functions above that have a description starting with 'is' rather than 'contains'
are meta functions that describe the attribute of the column/vector being passed
rather than attributes about the individual elements of the column/vector. The
meta functions will return a logical of length one and are not available under
which_are
.
# NOT RUN {
wa <- which_are()
it <- is_it()
wa$digit(c('The dog', "I like 2", NA))
it$digit(c('The dog', "I like 2", NA))
is_it()$list_column(c('the dog', 'ate the chicken'))
# }
Run the code above in your browser using DataLab