cos_sim: Compute the cosine similarity between one or more ALC embeddings and a set of features.

Description

Compute the cosine similarity between one or more ALC embeddings and a set of features.

Usage

cos_sim(
  x,
  pre_trained,
  features = NULL,
  stem = FALSE,
  language = "porter",
  as_list = TRUE,
  show_language = TRUE
)

Value

a data.frame or list of data.frames (one for each target) with the following columns:

target: (character) rownames of x, the labels of the ALC embeddings. NA if is.null(rownames(x)).
feature: (character) feature terms defined in the features argument.
value: (numeric) cosine similarity between x and feature.

Arguments

x: a (quanteda) dem-class or fem-class object.
pre_trained: (numeric) a F x D matrix corresponding to pretrained embeddings. F = number of features and D = embedding dimensions. rownames(pre_trained) = set of features for which there is a pre-trained embedding.
features: (character) features of interest.
stem: (logical) - If TRUE, both features and rownames(pre_trained) are stemmed and average cosine similarities are reported. We recommend you remove misspelled words from pre_trained as these can significantly influence the average.
language: the name of a recognized language, as returned by getStemLanguages, or a two- or three-letter ISO-639 code corresponding to one of these languages (see references for the list of codes).
as_list: (logical) if FALSE all results are combined into a single data.frame If TRUE, a list of data.frames is returned with one data.frame per feature.
show_language: (logical) if TRUE print out message with language used for stemming.

Examples

Run this code


library(quanteda)

# tokenize corpus
toks <- tokens(cr_sample_corpus)

# build a tokenized corpus of contexts sorrounding a target term
immig_toks <- tokens_context(x = toks, pattern = "immigr*", window = 6L)

# build document-feature matrix
immig_dfm <- dfm(immig_toks)

# construct document-embedding-matrix
immig_dem <- dem(immig_dfm, pre_trained = cr_glove_subset,
transform = TRUE, transform_matrix = cr_transform, verbose = FALSE)

# to get group-specific embeddings, average within party
immig_wv_party <- dem_group(immig_dem, groups = immig_dem@docvars$party)

# compute the cosine similarity between each party's embedding and a specific set of features
cos_sim(x = immig_wv_party, pre_trained = cr_glove_subset,
features = c('reform', 'enforcement'), as_list = FALSE)

Run the code above in your browser using DataLab