unnest_tokens
Split a column into tokens using the tokenizers package
Split a column into tokens using the tokenizers package, splitting the table
into one-token-per-row. unnest_tokens_
is the standard evaluation version.
Usage
unnest_tokens(tbl, output, input, token = "words", format = c("text", "man",
"latex", "html", "xml"), to_lower = TRUE, drop = TRUE, collapse = NULL,
...)unnest_tokens_(tbl, output, input, token = "words", format = c("text",
"man", "latex", "html", "xml"), to_lower = TRUE, drop = TRUE,
collapse = NULL, ...)
Arguments
- tbl
Data frame
- output
Output column to be created as bare name.
- input
Input column that gets split as bare name.
- token
Unit for tokenizing, or a custom tokenizing function. Built-in options are "words" (default), "characters", "ngrams", "skip_ngrams", "sentences", "lines", "paragraphs", and "regex". If a function, should take a character vector and return a list of character vectors of the same length.
- format
Either "text", "man", "latex", "html", or "xml". If not text, this uses the hunspell tokenizer, and can tokenize only by "word"
- to_lower
Whether to turn column lowercase.
- drop
Whether original input column should get dropped. Ignored if the original input and new output column have the same name.
- collapse
Whether to combine text with newlines first in case tokens (such as sentences or paragraphs) span multiple lines. If NULL, collapses when token method is "ngrams", "skip_ngrams", "sentences", "lines", "paragraphs", or "regex".
- ...
Extra arguments passed on to the tokenizer, such as
n
andk
for "ngrams" and "skip_ngrams" orpattern
for "regex".
Details
If the unit for tokenizing is ngrams, skip_ngrams, sentences, lines, paragraphs, or regex, the entire input will be collapsed together before tokenizing.
If format is anything other than "text", this uses the
hunspell_parse
tokenizer instead of the tokenizers package.
This does not yet have support for tokenizing by any unit other than words.
Examples
# NOT RUN {
library(dplyr)
library(janeaustenr)
d <- data_frame(txt = prideprejudice)
d
d %>%
unnest_tokens(word, txt)
d %>%
unnest_tokens(sentence, txt, token = "sentences")
d %>%
unnest_tokens(ngram, txt, token = "ngrams", n = 2)
d %>%
unnest_tokens(ngram, txt, token = "skip_ngrams", n = 4, k = 2)
d %>%
unnest_tokens(chapter, txt, token = "regex", pattern = "Chapter [\\d]")
# custom function
d %>%
unnest_tokens(word, txt, token = stringr::str_split, pattern = " ")
# tokenize HTML
h <- data_frame(row = 1:2,
text = c("<h1>Text <b>is<b>", "<a href='example.com'>here</a>"))
h %>%
unnest_tokens(word, text, format = "html")
# }