extract_phrases: Extract Phrases

Description

Extracts phrases from a list of POS tagged document using the "FilterFSA" method in Handler et al. 2016.

Usage

extract_phrases(POS_tagged_documents, regex = "(A|N)*N(PD*(A|N)*N)*",
  maximum_ngram_length = 8, minimum_ngram_length = 2,
  return_phrase_vectors = TRUE, return_tag_sequences = FALSE)

Arguments

POS_tagged_documents

A list object of the form produced by the `POS_tag_documents()` function, with either Penn TreeBank or Petrov/Gimpel style tags.

regex

The regular expression used to find phrases. Defaults to "(A|N)*N(PD*(A|N)*N)*", the "SimpleNP" grammar in Handler et al. 2016. A vector of regular expressions may also be provided if the user wishes to match more than one.

maximum_ngram_length

The maximum length phrases returned. Defaults to 8. Increasing this number can greatly increase runtime.

minimum_ngram_length

The minimum length phrases returned. Defaults to 2. Can be increased to remove shorter phrases, or decreased to include unigrams.

return_phrase_vectors

Logical indicating whether a list of phrase vectors (with each entry contain a vector of phrases in one document) should be returned, or whether phrases should combined into a single space separated string.

return_tag_sequences

Logical indicating whether tag sequences should be returned along with phrases. Defaults to FALSE.

Value

A list object.

Examples

Run this code

## Not run: ------------------------------------
# # make sure quanteda is installed
# requireNamespace("quanteda", quietly = TRUE)
# # load in U.S. presidential inaugural speeches from Quanteda example data.
# documents <- quanteda::data_corpus_inaugural
# # use first 10 documents for example
# documents <- documents[1:10,]
# 
# # run tagger
# tagged_documents <- POS_tag_documents(documents)
# 
# phrases <- extract_phrases(tagged_documents,
#                            regex = "(A|N)*N(PD*(A|N)*N)*",
#                            maximum_ngram_length = 8,
#                            minimum_ngram_length = 1)
## ---------------------------------------------

Run the code above in your browser using DataLab