VSS: A small corpus of very short stories with linguistic annotations

Description

This data set contains a small corpus (8043 tokens) of short stories from the collection Very Short Stories (VSS, see http://www.schtepf.de/History/pages/stories.html). The text was automatically segmented (tokenised) and annotated with part-of-speech tags (from the Penn tagset) and lemmas (base forms), using the IMS TreeTagger (Schmid 1994) and a custom lemmatizer.

Usage

VSS

Arguments

Format

A data set with 8043 rows corresponding to tokens and the following columns:

word:: the word form (or surface form) of the token
pos:: the part-of-speech tag of the token (Penn tagset)
lemma:: the lemma (or base form) of the token
sentence:: number of the sentence in which the token occurs (integer)
story:: title of the story to which the token belongs (factor)

Details

The Penn tagset defines the following part-of-speech tags:

`CC`	Coordinating conjunction
`CD`	Cardinal number
`DT`	Determiner
`EX`	Existential there
`FW`	Foreign word
`IN`	Preposition or subordinating conjunction
`JJ`	Adjective
`JJR`	Adjective, comparative
`JJS`	Adjective, superlative
`LS`	List item marker
`MD`	Modal
`NN`	Noun, singular or mass
`NNS`	Noun, plural
`NP`	Proper noun, singular
`NPS`	Proper noun, plural
`PDT`	Predeterminer
`POS`	Possessive ending
`PP`	Personal pronoun
`PP$`	Possessive pronoun
`RB`	Adverb
`RBR`	Adverb, comparative
`RBS`	Adverb, superlative
`RP`	Particle
`SYM`	Symbol
`TO`	to
`UH`	Interjection
`VB`	Verb, base form
`VBD`	Verb, past tense
`VBG`	Verb, gerund or present participle
`VBN`	Verb, past participle
`VBP`	Verb, non-3rd person singular present
`VBZ`	Verb, 3rd person singular present
`WDT`	Wh-determiner
`WP`	Wh-pronoun
`WP$`	Possessive wh-pronoun

References

Schmid, Helmut (1994). Probabilistic part-of-speech tagging using decision trees. In: Proceedings of the International Conference on New Methods in Language Processing (NeMLaP), pages 44-49.