clean

is.tokenizedTexts

tokenise

tokenize

tokenize.character

tokenize.corpus

the unit for splitting the text, defaults to <code>"word"</code>. 
Available alternatives are <code>c("character", "word", "line_break", 
"sentence")</code>. See <a href="/link/stringi-search-boundaries?package=quanteda&version=0.9.2-0&to=stringi" rd-options="stringi" data-mini-rdoc="stringi::stringi-search-boundaries">stringi-search-boundaries</a>.

what

remove tokens that consist only of numbers, but not 
words that start with digits, e.g. <code>2day</code>

removeNumbers

removePunct

remove Separators and separator characters (spaces 
and variations of spaces, plus tab, newlines, and anything else in the 
Unicode "separator" category) when <code>removePunct=FALSE</code>.  Only 
applicable for <code>what = "character"</code> (when you 

removeSeparators

remove Twitter characters <code>@</code> and <code>#}; set to
<code>FALSE</code> if you wish to eliminate these.</code><p></p><p><item>removeHyphens</item>{if <code>TRUE</code>, split words that are connected by 
hyphenation and hyphenation-like characters in bet</p>

removeTwitter

A fast, flexible toolset for for the management, processing, and
    quantitative analysis of textual data in R.

tokenize: tokenize a set of texts

Usage

Arguments