Learn R Programming

textpress (version 1.1.0)

nlp_tokenize_text: Tokenize Text Data (mostly) Non-Destructively

Description

Tokenizes text from a corpus data frame, preserving structure like capitalization and punctuation.

Usage

nlp_tokenize_text(
  corpus,
  by = c("doc_id", "paragraph_id", "sentence_id"),
  include_spans = TRUE,
  method = "word"
)

Value

A named list of tokens (or list of tokens and spans if include_spans = TRUE).

Arguments

corpus

A data frame or data.table containing a text column and the identifiers specified in by.

by

A character vector of column names used as unique identifiers. The last column determines the search unit (e.g., if by = c("doc_id", "para_id"), the search returns matches at the paragraph level).

include_spans

Logical. Include start/end character spans for each token.

method

Character. "word" or "biber".

Examples

Run this code
corpus <- data.frame(doc_id = c('1', '1', '2'),
                    sentence_id = c('1', '2', '1'),
                    text = c("Hello world.",
                             "This is an example.",
                             "This is a party!"))
tokens <- nlp_tokenize_text(corpus, by = c('doc_id', 'sentence_id'))

Run the code above in your browser using DataLab