ngramTokens: Ngram Tokenizer

Description

Tally bag-of-words ngram features

Usage

ngramTokens(
  texts,
  wstem = "all",
  ngrams = 1,
  language = "english",
  punct = TRUE,
  stop.words = TRUE,
  number.words = TRUE,
  overlap = 1,
  sparse = 0.995,
  verbose = FALSE,
  vocabmatch = NULL,
  num.mc.cores = 1
)

Arguments

texts

character vector of texts.

wstem

character Which words should be stemmed? Defaults to "all".

ngrams

numeric Vector of ngram lengths to be included. Default is 1 (i.e. unigrams only).

language

Language for stemming. Default is "english"

punct

logical Should punctuation be kept as tokens? Default is TRUE

stop.words

logical Should stop words be kept? Default is TRUE

number.words

logical Should numbers be kept as words? Default is TRUE

overlap

numeric Threshold (as cosine distance) for including ngrams that constitute other included phrases. Default is 1 (i.e. all ngrams included).

sparse

maximum feature sparsity for inclusion (1 = include all features)

verbose

logical Should the package report token counts after each ngram level? Useful for long-running code. Default is FALSE.

vocabmatch

matrix Should the new token count matrix will be coerced to include the same tokens as a previous count matrix? Default is NULL (i.e. no token match).

num.mc.cores

numeric number of cores for parallel processing - see parallel::detectCores(). Default is 1.

Value

a matrix of feature counts

Details

This function produces ngram featurizations of text based on the quanteda package. This provides a complement to the doc2concrete function by demonstrating How to build a feature set for training a new detection algorithm in other contexts.

Examples

Run this code

# NOT RUN {
dim(ngramTokens(feedback_dat$feedback, ngrams=1))
dim(ngramTokens(feedback_dat$feedback, ngrams=1:3))
# }

Run the code above in your browser using DataLab