Learn R Programming

cleanNLP (version 1.5.1)

get_token: Access tokens from an annotation object

Description

This function grabs the table of tokens from an annotation object. There is exactly one row for each token found in the raw text. Tokens include words as well as punctuation marks. A token called ROOT is also added to each sentence; it is particularly useful when interacting with the table of dependencies.

Usage

get_token(annotation)

Arguments

annotation

an annotation object

Value

Returns an object of class c("tbl_df", "tbl", "data.frame") containing one row for every token in the corpus. The root of each sentence is included as its own token.

The returned data frame includes at a minimum the following columns:

  • "id" - integer. Id of the source document.

  • "sid" - integer. Sentence id, starting from 0.

  • "tid" - integer. Token id, with the root of the sentence starting at 0.

  • "word" - character. Raw word in the input text.

  • "lemma" - character. Lemmatized form the token.

  • "upos" - character. Universal part of speech code.

  • "pos" - character. Language-specific part of speech code; uses the Penn Treebank codes.

  • "cid" - integer. Character offset at the start of the word in the original document.

References

Manning, Christopher D., Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP Natural Language Processing Toolkit. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55-60.

Kristina Toutanova and Christopher D. Manning. 2000. Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger. In: Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC-2000), pp. 63-70.

Kristina Toutanova, Dan Klein, Christopher Manning, and Yoram Singer. 2003. Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In: Proceedings of HLT-NAACL 2003, pp. 252-259.

Examples

Run this code
# NOT RUN {
data(obama)

# find average sentence length from each address
get_token(obama) %>%
  group_by(id, sid) %>%
  summarize(sentence_length = max(tid)) %>%
  summarize(avg_sentence_length = mean(sentence_length))
# }

Run the code above in your browser using DataLab