spacy_parse: Parse a text using spaCy

Description

The spacy_parse() function calls spaCy to both tokenize and tag the texts, and returns a data.table of the results. The function provides options on the types of tagsets (tagset_ options) either "google" or "detailed", as well as lemmatization (lemma). It provides a functionalities of dependency parsing and named entity recognition as an option. If "full_parse = TRUE" is provided, the function returns the most extensive list of the parsing results from spaCy.

Usage

spacy_parse(x, pos = TRUE, tag = FALSE, lemma = TRUE,
  entity = TRUE, dependency = FALSE, nounphrase = FALSE,
  multithread = TRUE, additional_attributes = NULL, ...)

Arguments

a character object, a quanteda corpus, or a TIF-compliant corpus data.frame (see https://github.com/ropensci/tif)

pos

logical whether to return universal dependency POS tagset http://universaldependencies.org/u/pos/)

tag

logical whether to return detailed part-of-speech tags, for the language model en, it uses the OntoNotes 5 version of the Penn Treebank tag set (https://spacy.io/docs/usage/pos-tagging#pos-schemes). Annotation specifications for other available languages are available on the spaCy website (https://spacy.io/api/annotation).

lemma

logical; include lemmatized tokens in the output (lemmatization may not work properly for non-English models)

entity

logical; if TRUE, report named entities

dependency

logical; if TRUE, analyse and tag dependencies

nounphrase

logical; if TRUE, analyse and tag noun phrases tags

multithread

logical; If TRUE, the processing is parallelized using pipe functionality of spaCy (https://spacy.io/api/pipe)

additional_attributes

a character vector; this option is for extracting additional attributes of tokens from spaCy. When the names of attributes are supplied, the output data.frame will contain additional variables corresponding to the names of the attributes. For instance, when additional_attributes = c("is_punct"), the output will include an additional variable named is_punct, which is a Boolean (in R, logical) variable indicating whether the token is a punctuation. A full list of available attributes is available from https://spacy.io/api/token#attributes.

...

not used directly

Value

a data.frame of tokenized, parsed, and annotated tokens

Examples

Run this code

# NOT RUN {
spacy_initialize()
# See Chap 5.1 of the NLTK book, http://www.nltk.org/book/ch05.html
txt <- "And now for something completely different."
spacy_parse(txt)
spacy_parse(txt, pos = TRUE, tag = TRUE)
spacy_parse(txt, dependency = TRUE)

txt2 <- c(doc1 = "The fast cat catches mice.\\nThe quick brown dog jumped.", 
          doc2 = "This is the second document.",
          doc3 = "This is a \\\"quoted\\\" text." )
spacy_parse(txt2, entity = TRUE, dependency = TRUE)

txt3 <- "We analyzed the Supreme Court with three natural language processing tools." 
spacy_parse(txt3, entity = TRUE, nounphrase = TRUE)
spacy_parse(txt3, additional_attributes = c("like_num", "is_punct"))
# }

Run the code above in your browser using DataLab