Construct a tokens object, either by importing a named list of characters from an external tokenizer, or by calling the internal quanteda tokenizer.
tokens(
x,
what = "word",
remove_punct = FALSE,
remove_symbols = FALSE,
remove_numbers = FALSE,
remove_url = FALSE,
remove_separators = TRUE,
split_hyphens = FALSE,
include_docvars = TRUE,
padding = FALSE,
verbose = quanteda_options("verbose"),
...
)
character; which tokenizer to use. The default what = "word"
is the version 2 quanteda tokenizer. Legacy tokenizers (version < 2)
are also supported, including the default what = "word1"
.
See the Details and quanteda Tokenizers below.
logical; if TRUE
remove all characters in the Unicode
"Punctuation" [P]
class, with exceptions for those used as prefixes for
valid social media tags if preserve_tags = TRUE
logical; if TRUE
remove all characters in the Unicode
"Symbol" [S]
class
logical; if TRUE
remove tokens that consist only of
numbers, but not words that start with digits, e.g. 2day
logical; if TRUE
find and eliminate URLs beginning with
http(s)
logical; if TRUE
remove separators and separator
characters (Unicode "Separator" [Z]
and "Control" [C]
categories)
logical; if TRUE
, split words that are connected by
hyphenation and hyphenation-like characters in between words, e.g.
"self-aware"
becomes c("self", "-", "aware")
if TRUE
, pass docvars through to the tokens object.
Does not apply when the input is a character data or a list of characters.
if TRUE
, leave an empty string where the removed tokens
previously existed. This is useful if a positional match is needed between
the pre- and post-selected tokens, for instance if a window of adjacency
needs to be computed.
if TRUE
, print timing messages to the console
used to pass arguments among the functions
quanteda tokens
class object, by default a serialized list of
integers corresponding to a vector of types.
The default word tokenizer what = "word"
splits tokens using
stri_split_boundaries(x, type = "word")
but by default preserves infix hyphens (e.g. "self-funding"), URLs, and
social media "tag" characters (#hashtags and @usernames), and email
addresses. The rules defining a valid "tag" can be found
here
for hashtags and
here
for usernames.
In versions < 2, the argument remove_twitter
controlled whether social
media tags were preserved or removed, even when remove_punct = TRUE
.
This argument is not longer functional in versions >= 2. If greater
control over social media tags is desired, you should user an alternative
tokenizer, including non-quanteda options.
For backward compatibility, the following older tokenizers are also
supported through what
:
"word1"
(legacy) implements similar behaviour to the version of
what = "word"
found in pre-version 2. (It preserves social media tags
and infix hyphens, but splits URLs.) "word1" is also slower than "word".
"fasterword"
(legacy) splits on whitespace and control characters,
using stringi::stri_split_charclass(x, "[\\p{Z}\\p{C}]+")
"fastestword"
(legacy) splits on the space character, using
stringi::stri_split_fixed(x, " ")
"character"
tokenization into individual characters
"sentence"
sentence segmenter based on stri_split_boundaries, but with additional rules to avoid splits on words like "Mr." that would otherwise incorrectly be detected as sentence boundaries. For better sentence tokenization, consider using spacyr.
As of version 2, the choice of tokenizer is left more to the user,
and tokens()
is treated more as a constructor (from a named list) than a
tokenizer. This allows users to use any other tokenizer that returns a
named list, and to use this as an input to tokens()
, with removal and
splitting rules applied after this has been constructed (passed as
arguments). These removal and splitting rules are conservative and will
not remove or split anything, however, unless the user requests it.
Using external tokenizers is best done by piping the output from these
other tokenizers into the tokens()
constructor, with additional removal
and splitting options applied at the construction stage. These will only
have an effect, however, if the tokens exist for which removal is specified
at in the tokens()
call. For instance, it is impossible to remove
punctuation if the input list to tokens()
already had its punctuation
tokens removed at the external tokenization stage.
To construct a tokens object from a list with no additional processing, call
as.tokens()
instead of tokens()
.
Recommended tokenizers are those from the tokenizers package, which are generally faster than the default (built-in) tokenizer but always splits infix hyphens, or spacyr.
tokens_ngrams()
, tokens_skipgrams()
, as.list.tokens()
,
as.tokens()
# NOT RUN {
txt <- c(doc1 = "A sentence, showing how tokens() works.",
doc2 = "@quantedainit and #textanalysis https://example.com?p=123.",
doc3 = "Self-documenting code??",
doc4 = "<U+00A3>1,000,000 for 50<U+00A2> is gr8 4ever \U0001f600")
tokens(txt)
tokens(txt, what = "word1")
# removing punctuation marks but keeping tags and URLs
tokens(txt[1:2], remove_punct = TRUE)
# splitting hyphenated words
tokens(txt[3])
tokens(txt[3], split_hyphens = TRUE)
# symbols and numbers
tokens(txt[4])
tokens(txt[4], remove_numbers = TRUE)
tokens(txt[4], remove_numbers = TRUE, remove_symbols = TRUE)
# }
# NOT RUN {
# using other tokenizers
tokens(tokenizers::tokenize_words(txt[4]), remove_symbols = TRUE)
tokenizers::tokenize_words(txt, lowercase = FALSE, strip_punct = FALSE) %>%
tokens(remove_symbols = TRUE)
tokenizers::tokenize_characters(txt[3], strip_non_alphanum = FALSE) %>%
tokens(remove_punct = TRUE)
tokenizers::tokenize_sentences(
"The quick brown fox. It jumped over the lazy dog.") %>%
tokens()
# }
# NOT RUN {
# }
Run the code above in your browser using DataLab