This function splits text from a data frame into individual sentences based on specified columns and handles abbreviations effectively.
Usage
nlp_split_sentences(
corpus,
by = c("doc_id"),
abbreviations = textpress::abbreviations
)
Value
A data.table with columns from by, plus sentence_id, text, start, end.
Arguments
corpus
A data frame or data.table containing a text column and the identifiers specified in by.
by
A character vector of column names used as unique identifiers.
The last column determines the search unit (e.g., if by = c("doc_id", "para_id"),
the search returns matches at the paragraph level).
abbreviations
A character vector of abbreviations to handle during sentence splitting, defaults to textpress::abbreviations.