tCorpus$preprocess: Preprocess feature

Description

Usage:

Arguments

column: the column containing the feature to be used as the input
new_column: the column to save the preprocessed feature. Can be a new column or overwrite an existing one.
lowercase: make feature lowercase
ngrams: create ngrams. The ngrams match the rows in the token data, with the feature in the row being the last token of the ngram. For example, given the features "this is an example", the third feature ("an") will have the trigram "this_is_an". Ngrams at the beginning of a context will have empty spaces. Thus, in the previous example, the second feature ("is") will have the trigram "_is_an".
ngram_context: Ngrams will not be created across contexts, which can be documents or sentences. For example, if the context_level is sentences, then the last token of sentence 1 will not form an ngram with the first token of sentence 2.
as_ascii: convert characters to ascii. This is particularly usefull for dealing with special characters.
remove_punctuation: remove (i.e. make NA) any features that are only punctuation (e.g., dots, comma's)
remove_stopwords: remove (i.e. make NA) stopwords. (!) Make sure to set the language argument correctly.
remove_numbers: remove features that are only numbers
use_stemming: reduce features (tokens) to their stem
language: The language used for stopwords and stemming
min_freq: an integer, specifying minimum token frequency.
min_docfreq: an integer, specifying minimum document frequency.
max_freq: an integer, specifying minimum token frequency.
max_docfreq: an integer, specifying minimum document frequency.
min_char: an integer, specifying minimum number of characters in a term
max_char: an integer, specifying maximum number of characters in a term

Details

## R6 method for class tCorpus. Use as tc$method (where tc is a tCorpus object).


preprocess(column='token', new_column='feature', lowercase=T, ngrams=1,
           ngram_context=c('document', 'sentence'), as_ascii=F, remove_punctuation=T,
           remove_stopwords=F, remove_numbers=F, use_stemming=F, language='english',
           min_freq=NULL, min_docfreq=NULL, max_freq=NULL, max_docfreq=NULL, min_char=NULL, max_char=NULL)

Examples

Run this code

tc = create_tcorpus('I am a SHORT example sentence! That I am!')

## default is lowercase without punctuation
tc$preprocess('token', 'preprocessed_1')

## delete stopwords and perform stemming
tc$preprocess('token', 'preprocessed_2', remove_stopwords = TRUE, use_stemming = TRUE)

## filter on minimum frequency
tc$preprocess('token', 'preprocessed_3', min_freq=2)

## make ngrams
tc$preprocess('token', 'preprocessed_4', ngrams = 3)

tc$tokens

Run the code above in your browser using DataLab