Learn R Programming

text2map (version 0.2.0)

get_centroid: Word embedding semantic centroid extractor

Description

The function outputs an averaged vector from a set of anchor terms' word vectors. This average is roughly equivalent to the intersection of the contexts in which each word is used. This semantic centroid can be used for a variety of ends, and specifically as input to CMDist(). get_centroid() requires a list of terms, string of terms, data.frame or matrix. In the latter two cases, the first column will be used. The vectors are aggregated using the simple average. Terms can be repeated, and are therefore "weighted" by their counts.

Usage

get_centroid(anchors, wv, missing = "stop")

Value

returns a one row matrix

Arguments

anchors

List of terms to be averaged

wv

Matrix of word embedding vectors (a.k.a embedding model) with rows as words.

missing

what action to take if terms are not in embeddings. If action = "stop" (default), the function is stopped and an error messages states which terms are missing. If action = "remove", missing terms or rows with missing terms are removed. Missing terms will be printed as a message.

Author

Dustin Stoltz

Examples

Run this code

# load example word embeddings
data(ft_wv_sample)

space1 <- c("spacecraft", "rocket", "moon")

cen1 <- get_centroid(anchors = space1, wv = ft_wv_sample)

space2 <- c("spacecraft rocket moon")
cen2 <- get_centroid(anchors = space2, wv = ft_wv_sample)

identical(cen1, cen2)

Run the code above in your browser using DataLab