Access tokens from an annotation object

This function grabs the table of tokens from an annotation object. There is exactly one row for each token found in the raw text. Tokens include words as well as punctuation marks. If include_root is set to TRUE, a token called ROOT is also added to each sentence; it is particularly useful when interacting with the table of dependencies.

get_token(annotation, include_root = FALSE, combine = FALSE,
  remove_na = combine, spaces = FALSE)

an annotation object


boolean. Should the sentence root be included? Set to FALSE by default.


boolean. Should other tables (dependencies, sentences, and entites) by merge with the tokens? Set to FALSE by default.


boolean. Should columns with only non-missing values be removed? This is mostly useful when working with the combine options, and by default is equal to whatever combine is set to.


should a column be included that gives the number of spaces that should come after the word. Useful for reconstructing the original text.


Returns an object of class c("tbl_df", "tbl", "data.frame") containing one row for every token in the corpus. The root of each sentence is included as its own token.

The returned data frame includes at a minimum the following columns, unless remove_na has been selected in which case only the first four columns are guaranteed to be in the output depending on which annotators were run:

  • "id" - integer. Id of the source document.

  • "sid" - integer. Sentence id, starting from 0.

  • "tid" - integer. Token id, with the root of the sentence starting at 0.

  • "word" - character. Raw word in the input text.

  • "lemma" - character. Lemmatized form the token.

  • "upos" - character. Universal part of speech code.

  • "pos" - character. Language-specific part of speech code; uses the Penn Treebank codes.

  • "cid" - integer. Character offset at the start of the word in the original document.


Manning, Christopher D., Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP Natural Language Processing Toolkit. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55-60.

Kristina Toutanova and Christopher D. Manning. 2000. Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger. In: Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC-2000), pp. 63-70.

Kristina Toutanova, Dan Klein, Christopher Manning, and Yoram Singer. 2003. Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In: Proceedings of HLT-NAACL 2003, pp. 252-259.

  • get_token

# find average sentence length from each address
get_token(obama) %>%
  group_by(id, sid) %>%
  summarize(sentence_length = max(tid)) %>%
  summarize(avg_sentence_length = mean(sentence_length))
# }
Documentation reproduced from package cleanNLP, version 1.10.0, License: LGPL-2

Community examples

Looks like there are no examples yet.