This data set contains a small corpus (8043 tokens) of short stories from the collection Very Short Stories (VSS, see http://www.schtepf.de/History/pages/stories.html). The text was automatically segmented (tokenised) and annotated with part-of-speech tags (from the Penn tagset) and lemmas (base forms), using the IMS TreeTagger (Schmid 1994) and a custom lemmatizer.
VSS
A data set with 8043 rows corresponding to tokens and the following columns:
word
:the word form (or surface form) of the token
pos
:the part-of-speech tag of the token (Penn tagset)
lemma
:the lemma (or base form) of the token
sentence
:number of the sentence in which the token occurs (integer)
story
:title of the story to which the token belongs (factor)
The Penn tagset defines the following part-of-speech tags:
CC |
Coordinating conjunction |
CD |
Cardinal number |
DT |
Determiner |
EX |
Existential there |
FW |
Foreign word |
IN |
Preposition or subordinating conjunction |
JJ |
Adjective |
JJR |
Adjective, comparative |
JJS |
Adjective, superlative |
LS |
List item marker |
MD |
Modal |
NN |
Noun, singular or mass |
NNS |
Noun, plural |
NP |
Proper noun, singular |
NPS |
Proper noun, plural |
PDT |
Predeterminer |
POS |
Possessive ending |
PP |
Personal pronoun |
PP$ |
Possessive pronoun |
RB |
Adverb |
RBR |
Adverb, comparative |
RBS |
Adverb, superlative |
RP |
Particle |
SYM |
Symbol |
TO |
to |
UH |
Interjection |
VB |
Verb, base form |
VBD |
Verb, past tense |
VBG |
Verb, gerund or present participle |
VBN |
Verb, past participle |
VBP |
Verb, non-3rd person singular present |
VBZ |
Verb, 3rd person singular present |
WDT |
Wh-determiner |
WP |
Wh-pronoun |
WP$ |
Possessive wh-pronoun |
Schmid, Helmut (1994). Probabilistic part-of-speech tagging using decision trees. In: Proceedings of the International Conference on New Methods in Language Processing (NeMLaP), pages 44-49.