Last chance! 50% off unlimited learning
Sale ends in
Removes sentences from a corpus or a character vector shorter than a specified length.
corpus_trimsentences(x, min_length = 1, max_length = 10000,
exclude_pattern = NULL, return_tokens = FALSE)char_trimsentences(x, min_length = 1, max_length = 10000,
exclude_pattern = NULL)
corpus or character object whose sentences will be selected.
minimum and maximum lengths in word tokens (excluding punctuation)
a stringi regular expression whose match (at the sentence level) will be used to exclude sentences
if TRUE
, return tokens object of sentences after
trimming, otherwise return the input object type with the trimmed sentences
removed.
a corpus or character vector equal in length to the input, or
a tokenized set of sentences if . If the input was a corpus, then the all
docvars and metadata are preserved. For documents whose sentences have
been removed entirely, a null string (""
) will be returned.
# NOT RUN {
txt <- c("PAGE 1. A single sentence. Short sentence. Three word sentence.",
"PAGE 2. Very short! Shorter.",
"Very long sentence, with three parts, separated by commas. PAGE 3.")
mycorp <- corpus(txt, docvars = data.frame(serial = 1:3))
texts(mycorp)
# exclude sentences shorter than 3 tokens
texts(corpus_trimsentences(mycorp, min_length = 3))
# exclude sentences that start with "PAGE <digit(s)>"
texts(corpus_trimsentences(mycorp, exclude_pattern = "^PAGE \\d+"))
# on a character
char_trimsentences(txt, min_length = 3)
char_trimsentences(txt, min_length = 3)
char_trimsentences(txt, exclude_pattern = "sentence\\.")
# }
Run the code above in your browser using DataLab