data_corpus_EPcoaldebate: Crowd-labelled sentence corpus from a 2010 EP debate on coal subsidies

Description

A multilingual text corpus of speeches from a European Parliament debate on coal subsidies in 2010, with individual crowd codings as the unit of observation. The sentences are drawn from officially translated speeches from a debate over a European Parliament debate concerning a Commission report proposing an extension to a regulation permitting state aid to uncompetitive coal mines.

Each speech is available in six languages: English, German, Greek, Italian, Polish and Spanish. The unit of observation is the individual crowd coding of each natural sentence. For more information on the coding approach see Benoit et al. (2016).

Usage

data_corpus_EPcoaldebate

Arguments

Format

The corpus consists of 16,806 documents (i.e. codings of a sentence) and includes the following document-level variables:

sentence_id: character; a unique identifier for each sentence
crowd_subsidy_label: factor; whether a coder labelled the sentence as "Pro-Subsidy", "Anti-Subsidy" or "Neutral or inapplicable"
language: factor; the language (translation) of the speech
name_last: character; speaker's last name
name_first: character; speaker's first name
ep_group: factor; abbreviation of the EP party group of the speaker
country: factor; the speaker's country of origin
vote: factor; the speaker's vote on the proposal (For/Against/Abstain/NA)
coder_id: character; a unique identifier for each crowd coder
coder_trust: numeric; the "trust score" from the Crowdflower platform used to code the sentences, which can theoretically range between 0 and 1. Only coders with trust scores above 0.8 are included in the corpus.

References

Benoit, K., Conway, D., Lauderdale, B.E., Laver, M., & Mikhaylov, S. (2016). Crowd-sourced Text Analysis: Reproducible and Agile Production of Political Data. American Political Science Review, 100,(2), 278--295.