Learn R Programming

textpress (version 1.1.0)

nlp_roll_chunks: Roll units into fixed-size chunks with optional context

Description

Groups consecutive rows at the finest by level (e.g. sentences) into fixed-size chunks and optionally adds surrounding context. Like a rolling window over the leaf units.

Usage

nlp_roll_chunks(corpus, by, chunk_size, context_size)

Value

A data.table with chunk_id, chunk (concatenated text), and chunk_plus_context.

Arguments

corpus

A data frame or data.table containing a text column and the identifiers specified in by.

by

A character vector of column names used as unique identifiers. The last column determines the search unit and is the level rolled into chunks (e.g., if by = c("doc_id", "sentence_id"), sentences are rolled into chunks).

chunk_size

Integer. Number of units per chunk.

context_size

Integer. Number of units of context around each chunk.

Examples

Run this code
corpus <- data.frame(doc_id = c('1', '1', '2'),
                    sentence_id = c('1', '2', '1'),
                    text = c("Hello world.",
                             "This is an example.",
                             "This is a party!"))
chunks <- nlp_roll_chunks(corpus, by = c('doc_id', 'sentence_id'),
                          chunk_size = 2, context_size = 1)

Run the code above in your browser using DataLab