get_grouped_similarity: Get averaged similarity scores between target word(s) and one or two vectors of candidate words.

Description

Get similarity scores between a target word or words and a comparison vector of one candidate word or words. When two vectors of candidate words are provided (second_vec is not NULL), the function calculates the cosine similarity between a composite index of the two vectors. This is operationalized as the mean similarity of the target word to the first vector of terms plus negative one multiplied by the mean similarity to the second vector of terms.

Usage

get_grouped_similarity(
  x,
  target,
  first_vec,
  second_vec,
  pre_trained,
  transform_matrix,
  group_var,
  window = window,
  norm = "l2",
  remove_punct = FALSE,
  remove_symbols = FALSE,
  remove_numbers = FALSE,
  remove_separators = FALSE,
  valuetype = "fixed",
  hard_cut = FALSE,
  case_insensitive = TRUE
)

Value

a data.frame with the following columns:

group: the grouping variable specified for the analysis
val: (numeric) cosine similarity scores

Arguments

x: a (quanteda) corpus object
target: (character) vector of words
first_vec: (character) vector of words
second_vec: (character) vector of words
pre_trained: (numeric) a F x D matrix corresponding to pretrained embeddings, usually trained on the same corpus as that used for x. F = number of features and D = embedding dimensions. rownames(pre_trained) = set of features for which there is a pre-trained embedding
transform_matrix: (numeric) a D x D 'a la carte' transformation matrix. D = dimensions of pretrained embeddings.
group_var: (character) variable name in corpus object defining grouping variable
window: (numeric) - defines the size of a context (words around the target)
norm: (character) - "l2" for l2 normalized cosine similarity and "none" for dot product
remove_punct: (logical) - if TRUE remove all characters in the Unicode "Punctuation" [P] class
remove_symbols: (logical) - if TRUE remove all characters in the Unicode "Symbol" [S] class
remove_numbers: (logical) - if TRUE remove tokens that consist only of numbers, but not words that start with digits, e.g. 2day
remove_separators: (logical) - if TRUE remove separators and separator characters (Unicode "Separator" [Z] and "Control" [C] categories)
valuetype: the type of pattern matching: "glob" for "glob"-style wildcard expressions; "regex" for regular expressions; or "fixed" for exact matching
hard_cut: (logical) - if TRUE then a context must have window x 2 tokens, if FALSE it can have window x 2 or fewer (e.g. if a doc begins with a target word, then context will have window tokens rather than window x 2)
case_insensitive: (logical) - if TRUE, ignore case when matching a target patter

Examples

Run this code

quanteda::docvars(cr_sample_corpus, 'year') <- rep(2011:2014, each = 50)
cos_simsdf <- get_grouped_similarity(cr_sample_corpus,
                                    group_var = "year",
                                    target = "immigration",
                                    first_vec = c("left", "lefty"),
                                    second_vec = c("right", "rightwinger"),
                                    pre_trained = cr_glove_subset,
                                    transform_matrix = cr_transform,
                                    window = 12L,
                                    norm = "l2")

Run the code above in your browser using DataLab