cleanNLP (version 2.3.0)

cnlp_annotate: Run the annotation pipeline on a set of documents

Description

Runs the clean_nlp annotators over a given corpus of text using either the R, Java, or Python backend. The details for which annotators to run and how to run them are specified by using one of: cnlp_init_tokenizers, cnlp_init_spacy, cnlp_init_udpipe, or cnlp_init_corenlp.

Usage

cnlp_annotate(input, as_strings = NULL, doc_ids = NULL,
  backend = NULL, meta = NULL, doc_var = "doc_id",
  text_var = "text")

Arguments

input

either a vector of file names to parse, a character vector with one document in each element, or a data frame. If a data frame, specify what column names contain the text and (optionally) document ids

as_strings

logical. Is the data given to input the actual document text or are they file names? If NULL, the default, will be set to FALSE if the input points to a valid file and TRUE otherwise.

doc_ids

optional character vector of document names

backend

which backend to use. Will default to the last model to be initalized.

meta

an optional data frame to bind to the document table

doc_var

if passing a data frame, character description of the column containing the document identifier; if this this variable does not exist in the dataset, automatic names will be given (or set to NULL to force automatic names)

text_var

if passing a data frame, which column contains the document identifier

Value

an object of class annotation

References

Manning, Christopher D., Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford corenlp Natural Language Processing Toolkit. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55-60.

Examples

Run this code
# NOT RUN {
annotation <- cnlp_annotate("path/to/corpus/directory")
# }
# NOT RUN {
# }

Run the code above in your browser using DataLab