Learn R Programming

conText (version 3.0.0)

get_grouped_similarity: Get averaged similarity scores between target word(s) and one or two vectors of candidate words.

Description

Get similarity scores between a target word or words and a comparison vector of one candidate word or words. When two vectors of candidate words are provided (second_vec is not NULL), the function calculates the cosine similarity between a composite index of the two vectors. This is operationalized as the mean similarity of the target word to the first vector of terms plus negative one multiplied by the mean similarity to the second vector of terms.

Usage

get_grouped_similarity(
  x,
  target,
  first_vec,
  second_vec,
  pre_trained,
  transform_matrix,
  group_var,
  window = window,
  norm = "l2",
  remove_punct = FALSE,
  remove_symbols = FALSE,
  remove_numbers = FALSE,
  remove_separators = FALSE,
  valuetype = "fixed",
  hard_cut = FALSE,
  case_insensitive = TRUE
)

Value

a data.frame with the following columns:

group

the grouping variable specified for the analysis

val

(numeric) cosine similarity scores

Arguments

x

a (quanteda) corpus object

target

(character) vector of words

first_vec

(character) vector of words

second_vec

(character) vector of words

pre_trained

(numeric) a F x D matrix corresponding to pretrained embeddings, usually trained on the same corpus as that used for x. F = number of features and D = embedding dimensions. rownames(pre_trained) = set of features for which there is a pre-trained embedding

transform_matrix

(numeric) a D x D 'a la carte' transformation matrix. D = dimensions of pretrained embeddings.

group_var

(character) variable name in corpus object defining grouping variable

window

(numeric) - defines the size of a context (words around the target)

norm

(character) - "l2" for l2 normalized cosine similarity and "none" for dot product

remove_punct

(logical) - if TRUE remove all characters in the Unicode "Punctuation" [P] class

remove_symbols

(logical) - if TRUE remove all characters in the Unicode "Symbol" [S] class

remove_numbers

(logical) - if TRUE remove tokens that consist only of numbers, but not words that start with digits, e.g. 2day

remove_separators

(logical) - if TRUE remove separators and separator characters (Unicode "Separator" [Z] and "Control" [C] categories)

valuetype

the type of pattern matching: "glob" for "glob"-style wildcard expressions; "regex" for regular expressions; or "fixed" for exact matching

hard_cut

(logical) - if TRUE then a context must have window x 2 tokens, if FALSE it can have window x 2 or fewer (e.g. if a doc begins with a target word, then context will have window tokens rather than window x 2)

case_insensitive

(logical) - if TRUE, ignore case when matching a target patter

Examples

Run this code
quanteda::docvars(cr_sample_corpus, 'year') <- rep(2011:2014, each = 50)
cos_simsdf <- get_grouped_similarity(cr_sample_corpus,
                                    group_var = "year",
                                    target = "immigration",
                                    first_vec = c("left", "lefty"),
                                    second_vec = c("right", "rightwinger"),
                                    pre_trained = cr_glove_subset,
                                    transform_matrix = cr_transform,
                                    window = 12L,
                                    norm = "l2")

Run the code above in your browser using DataLab