This function processes and curates ENNTT (European Parliament) data from a
specified directory.
It handles both .dat files (containing XML metadata) and .tok files
'(containing text content).
Usage
curate_enntt_data(dir_path)
Value
A tibble containing the curated ENNTT data with columns:
session_id: Parliamentary session identifier
speaker_id: Speaker's MEP ID
state: Representative's state/country
session_seq: Sequential position in session
text: Speech content
type: Corpus type identifier
Arguments
dir_path
A string. The path to the directory containing the ENNTT
data files. Must be an existing directory.
Details
The function expects a directory containing paired .dat and .tok files with
matching names, as found in the raw ENNTT data
https://github.com/senisioi/enntt-release.
The .dat files should contain XML-formatted metadata with attributes:
session_id: Unique identifier for the parliamentary session
mepid: Member of European Parliament ID
state: Country or state representation
seq_speaker_id: Sequential ID within the session
The .tok files should contain the corresponding text content, one entry per
line.
# Example using simulated data bundled with the packageexample_data <- system.file("extdata", "simul_enntt", package = "qtkit")
curated_data <- curate_enntt_data(example_data)
str(curated_data)