AnnotatedPlainTextDocument: Annotated Plain Text Documents

Description

Create annotated plain text documents from plain text and collections of annotations for this text.

Usage

AnnotatedPlainTextDocument(s, a, meta = list())
annotation(x)

Arguments

a String object, or something coercible to this using as.String() (e.g., a character string with appropriate encoding information).

an Annotation object with annotations for x.

Value

For AnnotatedPlainTextDocument(), an annotated plain text document object inheriting from "AnnotatedPlainTextTextDocument" and "TextDocument".

For annotation(), an Annotation object.

Details

Annotated plain text documents combine plain text with annotations for the text.

A typical workflow is to use annotate() with suitable annotator pipelines to obtain the annotations, and then use AnnotatedPlainTextDocument() to combine these with the text being annotated. This yields an object inheriting from "AnnotatedPlainTextDocument" and "TextDocument", from which the text and annotations can be obtained using, respectively, as.character() and annotation().

There are methods for class "AnnotatedPlainTextDocument" and generics words(), sents(), paras(), tagged_words(), tagged_sents(), tagged_paras(), chunked_sents(), parsed_sents() and parsed_paras() providing structured views of the text in such documents. These all require the necessary annotations to be available in the annotation object used.

The methods for generics tagged_words(), tagged_sents() and tagged_paras() provide a mechanism for mapping POS tags via the map argument, see section Details in the help page for tagged_words() for more information. The POS tagset used will be inferred from the POS_tagset metadata element of the annotation object used.

Examples

Run this code

# NOT RUN {
## Use a pre-built annotated plain text document obtained by employing an
## annotator pipeline from package 'StanfordCoreNLP', available from the
## repository at <https://datacube.wu.ac.at>, using the following code:
##   require("StanfordCoreNLP")
##   s <- paste("Stanford University is located in California.",
##              "It is a great university.")
##   p <- StanfordCoreNLP_Pipeline(c("pos", "lemma", "parse"))
##   doc <- AnnotatedPlainTextDocument(s, p(s))

doc <- readRDS(system.file("texts", "stanford.rds", package = "NLP"))

doc

## Extract available annotation:
a <- annotation(doc)
a

## Structured views:
sents(doc)
tagged_sents(doc)
tagged_sents(doc, map = Universal_POS_tags_map)
parsed_sents(doc)

## Add (trivial) paragraph annotation:
s <- as.character(doc)
a <- annotate(s, Simple_Para_Token_Annotator(blankline_tokenizer), a)
doc <- AnnotatedPlainTextDocument(s, a)
## Structured view:
paras(doc)
# }