lemmatize: Lemmatize Sentences Using a UDPipe Model

Description

This function processes a dataset of sentences using a UDPipe model to perform lemmatization, applies corrections to the lemmas, and associates metadata (e.g., submission ID, document) with the processed sentences. If UDPipie is not available, package NLP is used.

Usage

lemmatize(
  sentences,
  udpipe_model_file,
  corrections_file,
  language,
  use_seeds = TRUE
)

Value

A data frame containing the lemmatized sentences with the following columns:

doc_id: Document identifier.
lemma: The lemmatized form of each word.
upos: Universal part-of-speech tag.
sentenceid: The sentence ID for the sentence from which the lemma was extracted.
document: Document associated with the sentence.
submissionid: Submission ID associated with the sentence.

Arguments

sentences: A data frame containing sentences, with at least a sentenceid, sentence, document, and submissionid column.
udpipe_model_file: A character string representing the path to the UDPipe model file used for annotation.
corrections_file: A character string representing the path to a CSV file containing corrections to be applied to the lemmas.
language: A character string representing the language of the dataset ('nl', 'en', 'de' or 'fr')
use_seeds: A logical value indicating whether to use the seeds (e.g. of an educational framework)

Details

This function loads a UDPipe model, annotates the input sentences, corrects lemmas based on a provided corrections file, and optionally filters by noun tokens. It returns a data frame with the sentence-level annotations.