cnlp_annotate: Run the annotation pipeline on a set of documents

Description

Runs the clean_nlp annotators over a given corpus of text using either the R, Java, or Python backend. The details for which annotators to run and how to run them are specified by using one of: cnlp_init_tokenizers, cnlp_init_spacy, cnlp_init_udpipe, or cnlp_init_corenlp.

Usage

cnlp_annotate(input, as_strings = NULL, doc_ids = NULL,
  backend = NULL, meta = NULL, doc_var = "doc_id",
  text_var = "text")

Arguments

input

either a vector of file names to parse, a character vector with one document in each element, or a data frame. If a data frame, specify what column names contain the text and (optionally) document ids

as_strings

logical. Is the data given to input the actual document text or are they file names? If NULL, the default, will be set to FALSE if the input points to a valid file and TRUE otherwise.

doc_ids

optional character vector of document names

backend

which backend to use. Will default to the last model to be initalized.

Value

an object of class annotation

References

Manning, Christopher D., Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford corenlp Natural Language Processing Toolkit. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55-60.

Examples

Run this code

# NOT RUN {
annotation <- cnlp_annotate("path/to/corpus/directory")
# }
# NOT RUN {
# }

Run the code above in your browser using DataLab