spacy_parse: Parse a text using spaCy

Description

The spacy_parse() function calls spaCy to both tokenize and tag the texts, and returns a data.table of the results. The function provides options on the types of tagsets (tagset_ options) either "google" or "detailed", as well as lemmatization (lemma). It provides a functionalities of dependency parsing and named entity recognition as an option. If "full_parse = TRUE" is provided, the function returns the most extensive list of the parsing results from spaCy.

Usage

spacy_parse(x, pos = TRUE, tag = FALSE, lemma = TRUE, entity = TRUE,
  dependency = FALSE, multithread = TRUE, ...)

Arguments

a character object, a quanteda corpus, or a TIF-compliant corpus data.frame (see https://github.com/ropensci/tif)

pos

logical whether to return universal dependency POS tagset http://universaldependencies.org/u/pos/)

tag

logical whether to return detailed part-of-speech tags, for the langage model en, it uses the OntoNotes 5 version of the Penn Treebank tag set (https://spacy.io/docs/usage/pos-tagging#pos-schemes). Annotation specifications for other available languages are available on the spaCy website (https://spacy.io/api/annotation).

lemma

logical; inlucde lemmatized tokens in the output (lemmatization may not work properly for non-English models)

entity

logical; if TRUE, report named entities

dependency

logical; if TRUE, analyze and return dependencies

multithread

logical; If true, the processing is parallelized using pipe functionality of spacy (https://spacy.io/api/pipe).

...

not used directly

Value

a data.frame of tokenized, parsed, and annotated tokens

Examples

Run this code

# NOT RUN {
spacy_initialize()
# See Chap 5.1 of the NLTK book, http://www.nltk.org/book/ch05.html
txt <- "And now for something completely different."
spacy_parse(txt)
spacy_parse(txt, pos = TRUE, tag = TRUE)
spacy_parse(txt, dependency = TRUE)

txt2 <- c(doc1 = "The fast cat catches mice.\\nThe quick brown dog jumped.", 
          doc2 = "This is the second document.",
          doc3 = "This is a \\\"quoted\\\" text." )
spacy_parse(txt2, entity = TRUE, dependency = TRUE)
# }

Run the code above in your browser using DataLab