tidy_text: Tidy and Split Narrative Text

Description

This function processes narrative data by splitting the text into sentences or simply subsetting the data based on specific comment types. It ensures consistency across various comment types and removes unwanted columns and duplicates.

Usage

tidy_text(narratives, split_in_sentences = TRUE)

Value

A data.table containing sentences (or narrative data) with the following columns:

document: The document ID.
submissionid: The submission ID.
competencyid: The competency ID.
assistant: The assistant information.
portfolioid: The portfolio ID.
sentenceid: A unique identifier for each sentence (or narrative entry).
sentence: The cleaned-up sentence text.

Arguments

narratives: A data frame or data.table containing the narratives to be processed. The dataset should include columns representing different types of comments (e.g., sterk, verbeter, feedback).
split_in_sentences: A logical value indicating whether to split the text into individual sentences. If TRUE, the function will split the narratives into sentences using regular expressions. If FALSE, it will only subset the data based on comment types.

Details

The tidy_text function processes a dataset of narratives, splitting them into individual sentences (if split_in_sentences = TRUE) or subsetting them based on comment types (if split_in_sentences = FALSE). The comment types are predefined as sterk, verbeter, and feedback.

When split_in_sentences = TRUE, the function unnests the narrative data into sentences using regular expressions to identify sentence boundaries (periods, question marks, and exclamation marks).
When split_in_sentences = FALSE, the function subsets the data by comment type and ensures that the relevant columns are kept. The function also performs text cleaning by squishing spaces and removing HTML tags.