Learn R Programming

phrasemachine (version 1.1.2)

extract_phrases: Extract Phrases

Description

Extracts phrases from a list of POS tagged document using the "FilterFSA" method in Handler et al. 2016.

Usage

extract_phrases(POS_tagged_documents, regex = "(A|N)*N(PD*(A|N)*N)*",
  maximum_ngram_length = 8, minimum_ngram_length = 2,
  return_phrase_vectors = TRUE, return_tag_sequences = FALSE)

Arguments

POS_tagged_documents
A list object of the form produced by the `POS_tag_documents()` function, with either Penn TreeBank or Petrov/Gimpel style tags.
regex
The regular expression used to find phrases. Defaults to "(A|N)*N(PD*(A|N)*N)*", the "SimpleNP" grammar in Handler et al. 2016. A vector of regular expressions may also be provided if the user wishes to match more than one.
maximum_ngram_length
The maximum length phrases returned. Defaults to 8. Increasing this number can greatly increase runtime.
minimum_ngram_length
The minimum length phrases returned. Defaults to 2. Can be increased to remove shorter phrases, or decreased to include unigrams.
return_phrase_vectors
Logical indicating whether a list of phrase vectors (with each entry contain a vector of phrases in one document) should be returned, or whether phrases should combined into a single space separated string.
return_tag_sequences
Logical indicating whether tag sequences should be returned along with phrases. Defaults to FALSE.

Value

A list object.

Examples

Run this code
## Not run: ------------------------------------
# # make sure quanteda is installed
# requireNamespace("quanteda", quietly = TRUE)
# # load in U.S. presidential inaugural speeches from Quanteda example data.
# documents <- quanteda::data_corpus_inaugural
# # use first 10 documents for example
# documents <- documents[1:10,]
# 
# # run tagger
# tagged_documents <- POS_tag_documents(documents)
# 
# phrases <- extract_phrases(tagged_documents,
#                            regex = "(A|N)*N(PD*(A|N)*N)*",
#                            maximum_ngram_length = 8,
#                            minimum_ngram_length = 1)
## ---------------------------------------------

Run the code above in your browser using DataLab