cleanNLP v1.10.0


Monthly downloads



A Tidy Data Model for Natural Language Processing

Provides a set of fast tools for converting a textual corpus into a set of normalized tables. Users may make use of a Python back end with 'spaCy' <> or the Java back end 'CoreNLP' <>. A minimal back end with no external dependencies is also provided. Exposed annotation tasks include tokenization, part of speech tagging, named entity recognition, entity linking, sentiment analysis, dependency parsing, coreference resolution, and word embeddings. Summary statistics regarding token unigram, part of speech tag, and dependency type frequencies are also included to assist with analyses.

Functions in cleanNLP

Name Description
download_core_nlp Download java files needed for CoreNLP
extract_documents Extract documents from an annotation object
get_coreference Access coreferences from an annotation object
get_dependency Access dependencies from an annotation object
get_token Access tokens from an annotation object
get_vector Access word embedding vector from an annotation object
cleanNLP-package cleanNLP: A Tidy Data Model for Natural Language Processing
combine_documents Combine a set of annotations
get_sentence Access sentence-level annotations
get_tfidf Construct the TF-IDF Matrix from Annotation or Data Frame
read_annotation Read annotation files from disk
run_annotators Run the annotation pipeline on a set of documents
init_coreNLP Interface for initializing the coreNLP backend
init_spaCy Interface for initializing up the spaCy backend
pos_frequency Universal Part of Speech Code Frequencies
print.annotation Print a summary of an annotation object
dep_frequency Universal Dependency Frequencies
doc_id_reset Reset document ids
init_tokenizers Interface for initializing the tokenizers backend
from_CoNLL Reads a CoNLL-U or CoNLL-X File
get_combine One Table Summary of an Annotation Object
get_document Access document meta data from an annotation object
get_entity Access named entities from an annotation object
obama Annotation of Barack Obama's State of the Union Addresses
tidy_pca Compute Principal Components and store as a Data Frame
to_CoNNL Returns a CoNLL-U Document
word_frequency Most frequent English words
write_annotation Write annotation files to disk
No Results!

Vignettes of cleanNLP

No Results!

Last month downloads


Type Package
SystemRequirements Python (>= 2.7.0); spaCy (>= 1.8); Java (>= 7.0); Stanford CoreNLP (>= 3.7.0)
License LGPL-2
LazyData true
VignetteBuilder knitr
RoxygenNote 6.0.1
NeedsCompilation no
Packaged 2017-07-01 10:38:36 UTC; taylor
Repository CRAN
Date/Publication 2017-07-01 14:48:49 UTC

Include our badge in your README