- x
A character or factor vector in which each element is a token (i.e. a tokenized text)
- context
Optionally, a character vector of the same length as x, specifying the context of token (e.g., document, sentence). Has to be given if ngram > 1
- language
The language used for stemming and removing stopwords
- use_stemming
Logical, use stemming. (Make sure the specify the right language!)
- lowercase
Logical, make token lowercase
- ngrams
A number, specifying the number of tokens per ngram. Default is unigrams (1).
- replace_whitespace
Logical. If TRUE, all whitespace is replaced by underscores
- as_ascii
Logical. If TRUE, tokens will be forced to ascii
- remove_punctuation
Logical. if TRUE, punctuation is removed
- remove_stopwords
Logical. If TRUE, stopwords are removed (Make sure to specify the right language!)
- remove_numbers
remove features that are only numbers
- min_freq
an integer, specifying minimum token frequency.
- min_docfreq
an integer, specifying minimum document frequency.
- max_freq
an integer, specifying minimum token frequency.
- max_docfreq
an integer, specifying minimum document frequency.
- min_char
an integer, specifying minimum number of characters in a term
- max_char
an integer, specifying maximum number of characters in a term
- ngram_skip_empty
if ngrams are used, determines whether empty (filtered out) terms are skipped (i.e. c("this", NA, "test"), becomes "this_test") or