- dict
A dictionary. Can be either a data.frame or a quanteda dictionary. If a data.frame is given, it has to
have a column named "string" (or use string_col argument) that contains the dictionary terms, and a column "code" (or use code_col argument) that contains the
label/code represented by this string. Each row has a single string, that can be
a single word or a sequence of words seperated by a whitespace (e.g., "not bad"), and can have the common ? and * wildcards.
If a quanteda dictionary is given, it is automatically converted to this type of data.frame with the
melt_quanteda_dict
function. This can be done manually for more control over labels.
Finally, you can also just pass a character vector. All multi word strings (like emoticons) will then be
collapsed into single tokens.
- token_col
The feature in tc that contains the token text.
- string_col
If dict is a data.frame, the name of the column in dict with the dictionary lookup string. Default is "string"
- code_col
The name of the column in dict with the dictionary code/label. Default is "code".
If dict is a quanteda dictionary with multiple levels, "code_l2", "code_l3", etc. can be used to select levels.
- replace_cols
The names of the columns in tc$tokens that will be replaced by the dictionary code. Default is the column on which the dictionary is applied,
but in some cases it might make sense to replace multiple columns (like token and lemma)
- sep
A regular expression for separating multi-word lookup strings (default is " ", which is what quanteda dictionaries use).
For example, if the dictionary contains "Barack Obama", sep should be " " so that it matches the consequtive tokens "Barack" and "Obama".
In some dictionaries, however, it might say "Barack+Obama", so in that case sep = '\+' should be used.
- code_from_features
If TRUE, instead of replacing features with the matched code columnm, use the most frequent occuring string in the features.
- code_sep
If code_from_features is TRUE, the separator for pasting features together. Default is an underscore, which is recommended because it has special
features in corpustools. Most importantly, if a query or dictionary search is performed, multi-word tokens concatenated with an underscore are treated
as separate consecutive words. So, "Bob_Smith" would still match a lookup for the two consequtive words "bob smith"
- decrement_ids
If TRUE (default), decrement token ids after concatenating multi-token matches. So, if the tokens c(":", ")", "yay") have token_id c(1,2,3),
then after concatenating ASCII emoticons, the tokens will be c(":)", "yay") with token_id c(1,2)
- case_sensitive
logical, should lookup be case sensitive?
- use_wildcards
Use the wildcards * (any number including none of any character) and ? (one or none of any character). If FALSE, exact string matching is used
- ascii
If true, convert text to ascii before matching
- verbose
If true, report progress