preprocess_tokens: Preprocess tokens in a character vector

Description

Preprocess tokens in a character vector

Usage

preprocess_tokens(
  x,
  context = NULL,
  language = "english",
  use_stemming = F,
  lowercase = T,
  ngrams = 1,
  replace_whitespace = F,
  as_ascii = F,
  remove_punctuation = T,
  remove_stopwords = F,
  remove_numbers = F,
  min_freq = NULL,
  min_docfreq = NULL,
  max_freq = NULL,
  max_docfreq = NULL,
  min_char = NULL,
  max_char = NULL,
  ngram_skip_empty = T
)

Value

a factor vector

Arguments

x: A character or factor vector in which each element is a token (i.e. a tokenized text)
context: Optionally, a character vector of the same length as x, specifying the context of token (e.g., document, sentence). Has to be given if ngram > 1
language: The language used for stemming and removing stopwords
use_stemming: Logical, use stemming. (Make sure the specify the right language!)
lowercase: Logical, make token lowercase
ngrams: A number, specifying the number of tokens per ngram. Default is unigrams (1).
replace_whitespace: Logical. If TRUE, all whitespace is replaced by underscores
as_ascii: Logical. If TRUE, tokens will be forced to ascii
remove_punctuation: Logical. if TRUE, punctuation is removed
remove_stopwords: Logical. If TRUE, stopwords are removed (Make sure to specify the right language!)
remove_numbers: remove features that are only numbers
min_freq: an integer, specifying minimum token frequency.
min_docfreq: an integer, specifying minimum document frequency.
max_freq: an integer, specifying minimum token frequency.
max_docfreq: an integer, specifying minimum document frequency.
min_char: an integer, specifying minimum number of characters in a term
max_char: an integer, specifying maximum number of characters in a term
ngram_skip_empty: if ngrams are used, determines whether empty (filtered out) terms are skipped (i.e. c("this", NA, "test"), becomes "this_test") or

Examples

Run this code

tokens = c('I', 'am', 'a', 'SHORT', 'example', 'sentence', '!')

## default is lowercase without punctuation
preprocess_tokens(tokens)

## optionally, delete stopwords, perform stemming, and make ngrams
preprocess_tokens(tokens, remove_stopwords = TRUE, use_stemming = TRUE)
preprocess_tokens(tokens, context = NA, ngrams = 3)

Run the code above in your browser using DataLab