- column
the column containing the feature to be used as the input
- new_column
the column to save the preprocessed feature. Can be a new column or overwrite an existing one.
- lowercase
make feature lowercase
- ngrams
create ngrams. The ngrams match the rows in the token data, with the feature in the row being the last token of the ngram. For example, given the features "this is an example", the third feature ("an") will have the trigram "this_is_an". Ngrams at the beginning of a context will have empty spaces. Thus, in the previous example, the second feature ("is") will have the trigram "_is_an".
- ngram_context
Ngrams will not be created across contexts, which can be documents or sentences. For example, if the context_level is sentences, then the last token of sentence 1 will not form an ngram with the first token of sentence 2.
- as_ascii
convert characters to ascii. This is particularly usefull for dealing with special characters.
- remove_punctuation
remove (i.e. make NA) any features that are only punctuation (e.g., dots, comma's)
- remove_stopwords
remove (i.e. make NA) stopwords. (!) Make sure to set the language argument correctly.
- remove_numbers
remove features that are only numbers
- use_stemming
reduce features (tokens) to their stem
- language
The language used for stopwords and stemming
- min_freq
an integer, specifying minimum token frequency.
- min_docfreq
an integer, specifying minimum document frequency.
- max_freq
an integer, specifying minimum token frequency.
- max_docfreq
an integer, specifying minimum document frequency.
- min_char
an integer, specifying minimum number of characters in a term
- max_char
an integer, specifying maximum number of characters in a term