tokens_segment: Segment tokens object by patterns

Description

Segment tokens by splitting on a pattern match. This is useful for breaking the tokenized texts into smaller document units, based on a regular pattern or a user-supplied annotation. While it normally makes more sense to do this at the corpus level (see corpus_segment()), tokens_segment provides the option to perform this operation on tokens.

Usage

tokens_segment(
  x,
  pattern,
  valuetype = c("glob", "regex", "fixed"),
  case_insensitive = TRUE,
  extract_pattern = FALSE,
  pattern_position = c("before", "after"),
  use_docvars = TRUE
)

Value

tokens_segment returns a tokens object whose documents have been split by patterns

Arguments

x: tokens object whose token elements will be segmented
pattern: a character vector, list of character vectors, dictionary, or collocations object. See pattern for details.
valuetype: the type of pattern matching: "glob" for "glob"-style wildcard expressions; "regex" for regular expressions; or "fixed" for exact matching. See valuetype for details.
case_insensitive: logical; if TRUE, ignore case when matching a pattern or dictionary values
extract_pattern: remove matched patterns from the texts and save in docvars, if TRUE
pattern_position: either "before" or "after", depending on whether the pattern precedes the text (as with a tag) or follows the text (as with punctuation delimiters)
use_docvars: if TRUE, repeat the docvar values for each segmented text; if FALSE, drop the docvars in the segmented corpus. Dropping the docvars might be useful in order to conserve space or if these are not desired for the segmented corpus.

Examples

Run this code

txts <- "Fellow citizens, I am again called upon by the voice of my country to
execute the functions of its Chief Magistrate. When the occasion proper for
it shall arrive, I shall endeavor to express the high sense I entertain of
this distinguished honor."
toks <- tokens(txts)

# split by any punctuation
tokens_segment(toks, "^\\p{Sterm}$", valuetype = "regex",
               extract_pattern = TRUE,
               pattern_position = "after")
tokens_segment(toks, c(".", "?", "!"), valuetype = "fixed",
               extract_pattern = TRUE,
               pattern_position = "after")

Run the code above in your browser using DataLab