
Last chance! 50% off unlimited learning
Sale ends in
Segment tokens by splitting
on a pattern match. This is useful for breaking the tokenized texts into smaller
document units, based on a regular pattern or a user-supplied annotation.
While it normally makes more sense to do this at the corpus level (see corpus_segment
),
tokens_segment
provides the option to perform this operation on tokens.
tokens_segment(x, pattern, valuetype = c("glob", "regex", "fixed"),
case_insensitive = TRUE, extract_pattern = FALSE,
pattern_position = c("before", "after"), use_docvars = TRUE)
tokens object whose token elements will be segmented
a character vector, list of character vectors, dictionary, collocations, or dfm. See pattern for details.
the type of pattern matching: "glob"
for
"glob"-style wildcard expressions; "regex"
for regular expressions;
or "fixed"
for exact matching. See valuetype for details.
ignore case when matching, if TRUE
remove matched patterns from the texts and save in
docvars, if TRUE
either "before"
or "after"
, depending
on whether the pattern precedes the text (as with a tag) or follows the
text (as with punctuation delimiters)
if TRUE
, repeat the docvar values for each
segmented text; if FALSE
, drop the docvars in the segmented corpus.
Dropping the docvars might be useful in order to conserve space or if these
are not desired for the segmented corpus.
tokens_segment
returns a tokens object whose documents
have been split by patterns
# NOT RUN {
txts <- "Fellow citizens, I am again called upon by the voice of my country to
execute the functions of its Chief Magistrate. When the occasion proper for
it shall arrive, I shall endeavor to express the high sense I entertain of
this distinguished honor."
toks <- tokens(txts)
# split by any punctuation
toks_punc <- tokens_segment(toks, c(".", "?", "!"), valuetype = "fixed",
pattern_position = "after")
toks_punc <- tokens_segment(toks, "^\\p{Sterm}$", valuetype = "regex",
extract_pattern = FALSE,
pattern_position = "after")
# }
Run the code above in your browser using DataLab