cleanNLP-package: cleanNLP: A Tidy Data Model for Natural Language Processing

Description

Provides a set of fast tools for converting a textual corpus into a set of normalized tables. The underlying natural language processing pipeline utilizes either the Python module spaCy or the Java-based Stanford CoreNLP library. The Python option is faster and generally easier to install; the Java option has additional annotators that are not available in spaCy.

Arguments

Details

Once the package is set up, run one of init_tokenizers, init_spaCy, or init_coreNLP to load the desired NLP backend. After this function is done running, use run_annotators to run the annotation engine over a corpus of text. Functions are then available to extract data tables from the annotation object: get_token, get_dependency, get_document, get_coreference, get_entity, get_sentence, and get_vector. See their documentation for further details. The package vignettes provide more detailed set-up information.

If loading annotation that have previously been saved to disk, these can be pulled back into R using read_annotation. This does not require Java or Python nor does it require initializing the annotation pipeline.

Examples

Run this code

# NOT RUN {
# }
# NOT RUN {
# load the annotation engine (can also use spaCy and coreNLP backends)
setup_tokenizers_backend()
init_backend(type = "tokenizers")

# annotate your text
annotation <- run_annotators("path/to/corpus/directory")

# pull off data tables
token <- get_token(annotation)
dependency <- get_dependency(annotation)
document <- get_document(annotation)
coreference <- get_coreference(annotation)
entity <- get_entity(annotation)
sentiment <- get_sentence(annotation)
vector <- get_vector(annotation)
# }
# NOT RUN {
# }

Run the code above in your browser using DataLab

Description

Arguments

Details

See Also

Examples