get_entity: Access named entities from an annotation object

Description

Named entity recognition attempts to find the mentions of various categories within the corpus of text. Common example include proper references to location (e.g., "Boston", or "England") or people (e.g., "Winston Churchill"), as well as specific dates (e.g., "tomorrow", or "September 19th") times, or numbers.

Usage

get_entity(annotation)

Arguments

annotation

an annotation object

Value

Returns an object of class c("tbl_df", "tbl", "data.frame") containing one row for every named entity mention in the corpus.

The returned data frame includes the following columns:

"id" - integer. Id of the source document.
"sid" - integer. Sentence id of the entity mention.
"tid" - integer. Token id at the start of the entity mention.
"tid_end" - integer. Token id at the end of the entity mention.
"entity_type" - character. See below from details.
"entity" - character. Raw words of the named entity in the text.

Details

When using CoreNLP, the default entity types are:

"LOCATION" Countries, cities, states, locations, mountain ranges, bodies of water.
"PERSON" People, including fictional.
"ORGANIZATION" Companies, agencies, institutions, etc.
"MONEY" Monetary values, including unit.
"PERCENT" Percentages.
"DATE" Absolute or relative dates or periods.
"TIME" Times smaller than a day.

For the spaCy engine there is no generic LOCATION, ORGANIZATION is shortened to ORG, and the following categories are added:

"NORP" Nationalities or religious or political groups.
"FACILITY" Buildings, airports, highways, bridges, etc.
"GPE" Countries, cities, states.
"LOC" Non-GPE locations, mountain ranges, bodies of water.
"PRODUCT" Objects, vehicles, foods, etc. (Not services.)
"EVENT" Named hurricanes, battles, wars, sports events, etc.
"WORK_OF_ART" Titles of books, songs, etc.
"LANGUAGE" Any named language.
"QUANTITY" Measurements, as of weight or distance.
"ORDINAL" "first", "second", etc.
"CARDINAL" Numerals that do not fall under another type.

References

Manning, Christopher D., Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP Natural Language Processing Toolkit. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55-60.

Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. In: Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005), pp. 363-370.

Examples

Run this code

# NOT RUN {
require(dplyr)
data(obama)

# what are the most common entity types used in the addresses?
get_entity(obama)$entity_type %>%
 table()

# what are the most common locations mentioned?
res <- get_entity(obama) %>%
  filter(entity_type == "LOCATION")
res$entity %>%
  table() %>%
  sort(decreasing = TRUE) %>%
  head(n = 25)

# what are the most common organizations mentioned?
res <- get_entity(obama) %>%
  filter(entity_type == "ORGANIZATION")
res$entity %>%
  table() %>%
  sort(decreasing = TRUE) %>%
  head(n = 25)

# }

Run the code above in your browser using DataLab