Learn R Programming

idiolect (version 1.0.1)

tokenize_sents: Tokenize to sentences

Description

This function turns a corpus of texts into a quanteda tokens object of sentences.

Usage

tokenize_sents(corpus, model = "en_core_web_sm")

Value

A quanteda tokens object where each token is a sentence.

Arguments

corpus

A quanteda corpus object, typically the output of the create_corpus() function or the output of contentmask().

model

The spacy model to use. The default is "en_core_web_sm".

Details

The function first split each text into paragraphs by splitting at new line markers and then uses spacy to tokenize each paragraph into sentences. The function accepts a plain text corpus input or the output of contentmask(). This function is necessary to prepare the data for lambdaG().

Examples

Run this code
if (FALSE) {
toy.pos <- corpus("the N was on the N . he did n't move \n N ; \n N N")
tokenize_sents(toy.pos)
}

Run the code above in your browser using DataLab