Learn R Programming

About

conText provides a fast, flexible and transparent framework to estimate context-specific word and short document embeddings using the 'a la carte' embeddings approach developed by Khodak et al. (2018) and evaluate hypotheses about covariate effects on embeddings using the regression framework developed by Rodriguez et al. (2021).

How to Install

install.packages("conText")

Datasets

To use conText you will need three objects:

  1. A (quanteda) corpus with the documents and corresponding document variables you want to evaluate.
  2. A set of (GloVe) pre-trained embeddings.
  3. A transformation matrix specific to the pre-trained embeddings.

conText includes sample objects for all three but keep in mind these are just meant to illustrate function implementations. In this Dropbox folder we have included the raw versions of these objects including the full Stanford GloVe 300-dimensional embeddings (labeled glove.rds) and its corresponding transformation matrix estimated by Khodak et al. (2018) (labeled khodakA.rds). We provide an equivalent RDS file for the 2024 GloVe embeddings released in July 2025 (labeled _glove_2024.rds).

Quick Start Guides

Check out this Quick Start Guide to get going with conText (last updated: 07/28/2025).

Latest Updates

As noted in Rodriguez et al. (2023) (p. 1272), distance measures typically used to compare representations in high-dimensional space (such as embedding vectors) exhibit statistical bias. In Green et al. (2025), we explore the severity of this problem for text-as-data applications and provide and validate a bias correction for the squared Euclidean distance. We implement this estimator and other recommendations from the paper in the latest update to the conText() function. Please refer to the Bias in Distance Measures vignette for additional information and the Quick Start Guide for examples of how to use the new version of the function and a description of changes in the output.

Multilanguage Resources

For those working in languages other than English, we have a set of data and code resources here

Copy Link

Version

Install

install.packages('conText')

Monthly Downloads

231

Version

3.0.0

License

GPL-3

Issues

Pull Requests

Stars

Forks

Maintainer

Sofia Avila

Last Published

September 3rd, 2025

Functions in conText (3.0.0)

dem-class

Virtual class "dem" for a document-embedding matrix
cr_sample_corpus

Congressional Record sample corpus
conText

Embedding regression
contrast_nns

Contrast nearest neighbors
cos_sim

Compute the cosine similarity between one or more ALC embeddings and a set of features.
dem_sample

Randomly sample documents from a dem
get_context

Get context words (words within a symmetric window around the target word/phrase) sorrounding a user defined target.
find_nns

Return nearest neighbors based on cosine similarity
get_grouped_similarity

Get averaged similarity scores between target word(s) and one or two vectors of candidate words.
get_cos_sim

Given a tokenized corpus, compute the cosine similarities of the resulting ALC embeddings and a defined set of features.
feature_sim

Given two feature-embedding-matrices, compute "parallel" cosine similarities between overlapping features.
fem-class

Virtual class "fem" for a feature-embedding matrix
get_seq_cos_sim

Calculate cosine similarities between target word and candidates words over sequenced variable using ALC embedding approach
ncs

Given a set of embeddings and a set of tokenized contexts, find the top N nearest contexts.
tokens_context

Get the tokens of contexts sorrounding user defined patterns
get_ncs

Given a set of tokenized contexts, find the top N nearest contexts.
get_local_vocab

Identify words common to a collection of texts and a set of pretrained embeddings.
find_cos_sim

Find cosine similarities between target and candidate words
fem

Create an feature-embedding matrix
get_nns

Given a tokenized corpus and a set of candidate neighbors, find the top N nearest neighbors.
get_nns_ratio

Given a corpus and a binary grouping variable, computes the ratio of cosine similarities over the union of their respective N nearest neighbors.
permute_contrast

Permute similarity and ratio computations
plot_nns_ratio

Plot output of get_nns_ratio()
prototypical_context

Find most "prototypical" contexts.
nns

Given a set of embeddings and a set of candidate neighbors, find the top N nearest neighbors.
nns_ratio

Computes the ratio of cosine similarities for two embeddings over the union of their respective top N nearest neighbors.
embed_target

Embed target using either: (a) a la carte OR (b) simple (untransformed) averaging of context embeddings
run_ols

Run OLS
bootstrap_contrast

Bootstrap similarity and ratio computations
compute_similarity

Compute similarity vector (sub-function of bootstrap_similarity)
bootstrap_similarity

Boostrap similarity vector
build_conText

build a conText-class object
build_fem

build a fem-class object
compute_transform

Compute transformation matrix A
compute_contrast

Compute similarity and similarity ratios
build_dem

build a dem-class object
cr_transform

Transformation matrix
dem

Build a document-embedding matrix
cr_glove_subset

GloVe subset
bootstrap_nns

Bootstrap nearest neighbors
conText-package

conText: 'a la Carte' on Text (ConText) Embedding Regression
dem_group

Average document-embeddings in a dem by a grouping variable
conText-class

Virtual class "conText" for a conText regression output