character Which words should be stemmed? Defaults to "all".
ngrams
numeric Vector of ngram lengths to be included. Default is 1 (i.e. unigrams only).
language
Language for stemming. Default is "english"
punct
logical Should punctuation be kept as tokens? Default is TRUE
stop.words
logical Should stop words be kept? Default is TRUE
number.words
logical Should numbers be kept as words? Default is TRUE
overlap
numeric Threshold (as cosine distance) for including ngrams that constitute other included phrases. Default is 1 (i.e. all ngrams included).
sparse
maximum feature sparsity for inclusion (1 = include all features)
verbose
logical Should the package report token counts after each ngram level? Useful for long-running code. Default is FALSE.
vocabmatch
matrix Should the new token count matrix will be coerced to include the same tokens as a previous count matrix? Default is NULL (i.e. no token match).
num.mc.cores
numeric number of cores for parallel processing - see parallel::detectCores(). Default is 1.
Value
a matrix of feature counts
Details
This function produces ngram featurizations of text based on the quanteda package. This provides a complement to the doc2concrete function by demonstrating
How to build a feature set for training a new detection algorithm in other contexts.