- text
Texts to be processed. This can be a vector (such as a column in a data frame)
or list. When a list, these can be in the form returned with tokens.only = TRUE,
or a list with named vectors, where names are tokens and values are frequencies or the like.
- exclude
A character vector of words to be excluded. If exclude is a single string
matching 'function', lma_dict(1:9) will be used.
- context
A character vector used to reformat text based on look- ahead/behind. For example,
you might attempt to disambiguate like by reformatting certain likes
(e.g., context = c('(i) like*', '(you) like*', '(do) like'), where words in parentheses are
the context for the target word, and asterisks denote partial matching). This would be converted
to regular expression (i.e., '(? <= i) like\\b') which, if matched, would be
replaced with a coded version of the word (e.g., "Hey, i like that!" would become
"Hey, i i-like that!"). This would probably only be useful for categorization, where a
dictionary would only include one or another version of a word (e.g., the LIWC 2015 dictionary
does something like this with like, and LIWC 2007 did something like this with
kind (of), both to try and clean up the posemo category).
- replace.special
Logical: if TRUE, special characters are replaced with regular
equivalents using the lma_dict special function.
- numbers
Logical: if TRUE, numbers are preserved.
- punct
Logical: if TRUE, punctuation is preserved.
- urls
Logical: if FALSE, attempts to replace all urls with "repurl".
- emojis
Logical: if TRUE, attempts to replace emojis (e.g., ":(" would be replaced
with "repfrown").
- to.lower
Logical: if FALSE, words with different capitalization are treated as
different terms.
- word.break
A regular expression string determining the way words are split. Default is
' +' which breaks words at one or more blank spaces. You may also like to break by
dashes or slashes ('[ /-]+'), depending on the text.
- dc.min
Numeric: excludes terms appearing in the set number or fewer documents.
Default is 0 (no limit).
- dc.max
Numeric: excludes terms appearing in the set number or more. Default
is Inf (no limit).
- sparse
Logical: if FALSE, a regular dense matrix is returned.
- tokens.only
Logical: if TRUE, returns a list rather than a matrix, with these entries:
tokens | A vector of indices with terms as names. |
frequencies | A vector of counts with terms as names. |
WC | A vector of term counts for each document. |
indices | A list with a vector of token indices for each document. |