contrast_nns: Contrast nearest neighbors

Description

Computes the ratio of cosine similarities between group embeddings and features --that is, for any given feature it first computes the similarity between that feature and each group embedding, and then takes the ratio of these two similarities. This ratio captures how "discriminant" a feature is of a given group.

Usage

contrast_nns(
  x,
  groups = NULL,
  pre_trained = NULL,
  transform = TRUE,
  transform_matrix = NULL,
  bootstrap = TRUE,
  num_bootstraps = 100,
  confidence_level = 0.95,
  permute = TRUE,
  num_permutations = 100,
  candidates = NULL,
  N = 20,
  verbose = TRUE
)

Value

a data.frame with following columns:

feature: (character) vector of feature terms corresponding to the nearest neighbors.
value: (numeric) ratio of cosine similarities. Average over bootstrapped samples if bootstrap = TRUE.
std.error: (numeric) std. error of the ratio of cosine similarties. Column is dropped if bootsrap = FALSE.
lower.ci: (numeric) (if bootstrap = TRUE) lower bound of the confidence interval.
upper.ci: (numeric) (if bootstrap = TRUE) upper bound of the confidence interval.
p.value: (numeric) empirical p-value. Column is dropped if permute = FALSE.

Arguments

x: (quanteda) tokens-class object
groups: (numeric, factor, character) a binary variable of the same length as x
pre_trained: (numeric) a F x D matrix corresponding to pretrained embeddings. F = number of features and D = embedding dimensions. rownames(pre_trained) = set of features for which there is a pre-trained embedding.
transform: (logical) if TRUE (default) apply the 'a la carte' transformation, if FALSE ouput untransformed averaged embeddings.
transform_matrix: (numeric) a D x D 'a la carte' transformation matrix. D = dimensions of pretrained embeddings.
bootstrap: (logical) if TRUE, use bootstrapping -- sample from texts with replacement and re-estimate cosine ratios for each sample. Required to get std. errors.
num_bootstraps: (numeric) - number of bootstraps to use
confidence_level: (numeric in (0,1)) confidence level e.g. 0.95
permute: (logical) - if TRUE, compute empirical p-values using a permutation test
num_permutations: (numeric) - number of permutations to use
candidates: (character) vector of candidate features for nearest neighbors
N: (numeric) - nearest neighbors are subset to the union of the N neighbors of each group (if NULL, ratio is computed for all features)
verbose: (logical) - if TRUE, report the documents that had no overlapping features with the pretrained embeddings provided.

Examples

Run this code


library(quanteda)

cr_toks <- tokens(cr_sample_corpus)

immig_toks <- tokens_context(x = cr_toks,
pattern = "immigration", window = 6L, hard_cut = FALSE, verbose = TRUE)

# sample 100 instances of the target term, stratifying by party (only for example purposes)
set.seed(2022L)
immig_toks <- tokens_sample(immig_toks, size = 100, by = docvars(immig_toks, 'party'))

set.seed(42L)
party_nns <- contrast_nns(x = immig_toks,
groups = docvars(immig_toks, 'party'),
pre_trained = cr_glove_subset,
transform = TRUE, transform_matrix = cr_transform,
bootstrap = TRUE,
num_bootstraps = 100,
confidence_level = 0.95,
permute = TRUE, num_permutations = 10,
candidates = NULL, N = 20,
verbose = FALSE)

head(party_nns)

Run the code above in your browser using DataLab