report_term_matches: Generate a Report of Term Matches

Description

Extract matches to fuzzy terms (globs/wildcards or regular expressions) from provided text, in order to assess their appropriateness for inclusion in a dictionary.

Usage

report_term_matches(dict, text = NULL, space = NULL, glob = TRUE,
  parse_phrases = TRUE, tolower = TRUE, punct = TRUE, special = TRUE,
  as_terms = FALSE, bysentence = FALSE, as_string = TRUE,
  term_map_freq = 1, term_map_spaces = 1, outFile = NULL,
  space_dir = getOption("lingmatch.lspace.dir"), verbose = TRUE)

Value

A data.frame of results, with a row for each unique term, and the following columns:

term: The originally entered term.
regex: The converted and applied regular expression form of the term.
categories: Comma-separated category names, if dict is a list with named entries.
count: Total number of matches to the term.
max_count: Number of matches to the most representative (that with the highest average similarity) variant of the term.
variants: Number of variants of the term.
space: Name of the latent semantic space, if one was used.
mean_sim: Average similarity to the most representative variant among terms found in the space, if one was used.
min_sim: Minimal similarity to the most representative variant.
matches: Variants, with counts and similarity (Pearson's r) to the most representative term (if a space was specified). Either in the form of a comma-separated string or a data.frame (if as_string is FALSE).

Arguments

dict: A vector of terms, list of such vectors, or a matrix-like object to be categorized by read.dic.
text: A vector of text to extract matches from. If not specified, will use the terms in the term_map retrieved from select.lspace.
space: A vector space used to calculate similarities between term matches. Name of a the space (see select.lspace), a matrix with terms as row names, or TRUE to auto-select a space based on matched terms.
glob: Logical; if TRUE, converts globs (asterisk wildcards) to regular expressions. If not specified, this will be set automatically.
parse_phrases: Logical; if TRUE (default) and space is specified, will break unmatched phrases into single terms, and average across and matched vectors.
tolower: Logical; if FALSE, will retain text's case.
punct: Logical; if FALSE, will remove punctuation markings in text.
special: Logical; if FALSE, will attempt to replace special characters in text.
as_terms: Logical; if TRUE, will treat text as terms, meaning dict terms will only count as matches when matching the complete text.
bysentence: Logical; if TRUE, will split text into sentences, and only consider unique sentences.
as_string: Logical; if FALSE, returns matches as tables rather than a string.
term_map_freq: Proportion of terms to include when using the term map as a source of terms. Applies when text is not specified.
term_map_spaces: Number of spaces in which a term has to appear to be included. Applies when text is not specified.
outFile: File path to write results to, always ending in .csv.
space_dir: Directory from which space should be loaded.
verbose: Logical; if FALSE, will not display status messages.

Examples

Run this code

text <- c(
  "I am sadly homeless, and suffering from depression :(",
  "This wholesome happiness brings joy to my heart! :D:D:D",
  "They are joyous in these fearsome happenings D:",
  "I feel weightless now that my sadness has been depressed! :()"
)
dict <- list(
  sad = c("*less", "sad*", "depres*", ":("),
  happy = c("*some", "happ*", "joy*", "d:"),
  self = c("i *", "my *")
)

report_term_matches(dict, text)

Run the code above in your browser using DataLab

Description

Usage

Value

Arguments

See Also

Examples