Create text documents from CoNLL-style files.
CoNLLTextDocument(con, encoding = "unknown", meta = list())
a connection object or a character string.
See scan()
for details.
encoding to be assumed for input strings.
See scan()
for details.
a named or empty list of document metadata tag-value pairs.
An object inheriting from "CoNLLTextDocument"
and
"TextDocument"
.
CoNLL-style files use an extended tabular format where empty lines
separate sentences, and non-empty lines consist of whitespace
separated columns giving the word tokens and annotations for these.
In principle, these annotations can vary from corpus to corpus: the
current version of CoNLLTextDocument()
assumes a fixed set of 3
columns giving, respectively, the word token and its POS and chunk
tags.
The lines are read from the given connection and split into fields
using scan()
. From this, a suitable representation of
the provided information is obtained, and returned as a CoNLL text
document object inheriting from classes "CoNLLTextDocument"
and
"TextDocument"
.
There are methods for generics
words()
,
sents()
,
tagged_words()
,
tagged_sents()
, and
chunked_sents()
(as well as as.character()
)
and class "CoNLLTextDocument"
,
which should be used to access the text in such text document
objects.
The methods for generics
tagged_words()
and
tagged_sents()
provide a mechanism for mapping POS tags via the map
argument,
see section Details in the help page for
tagged_words()
for more information.
The POS tagset used will be inferred from the POS_tagset
metadata element of the CoNLL-style text document.
TextDocument
for basic information on the text document
infrastructure employed by package NLP.
http://ifarm.nl/signll/conll/ for general information about CoNLL (Conference on Natural Language Learning), the yearly meeting of the Special Interest Group on Natural Language Learning of the Association for Computational Linguistics.
http://www.cnts.ua.ac.be/conll2000/chunking/ for the CoNLL 2000
chunking task, and training and test data sets which can be read in
using CoNLLTextDocument()
.