text2vec (version 0.6)

RelaxedWordMoversDistance: Creates Relaxed Word Movers Distance (RWMD) model

Description

RWMD model can be used to query the "relaxed word movers distance" from a document to a collection of documents. RWMD tries to measure distance between query document and collection of documents by calculating how hard is to transform words from query document into words from each document in collection. For more detail see following article: http://mkusner.github.io/publications/WMD.pdf. However in contrast to the article above we calculate "easiness" of the convertion of one word into another by using cosine similarity (but not a euclidean distance). Also here in text2vec we've implemented effiient RWMD using the tricks from the Linear-Complexity Relaxed Word Mover's Distance with GPU Acceleration article.

Usage

RelaxedWordMoversDistance

RWMD

Arguments

Format

R6Class object.

Usage

For usage details see Methods, Arguments and Examples sections.

rwmd = RelaxedWordMoversDistance$new(x, embeddings)
rwmd$sim2(x)

Methods

$new(x, embeddings)

Constructor for RWMD model. x - docuent-term matrix which represents collection of documents against which you want to perform queries. embeddings - matrix of word embeddings which will be used to calculate similarities between words (each row represents a word vector).

$sim(x)

calculates similarity from a collection of documents to collection query documents x. x here is a document-term matrix which represents the set of query documents

$dist(x)

calculates distance from a collection of documents to collection query documents x x here is a document-term matrix which represents the set of query documents

Examples

Run this code
# NOT RUN {
library(text2vec)
library(rsparse)
data("movie_review")
tokens = word_tokenizer(tolower(movie_review$review))
v = create_vocabulary(itoken(tokens))
v = prune_vocabulary(v, term_count_min = 5, doc_proportion_max = 0.5)
it = itoken(tokens)
vectorizer = vocab_vectorizer(v)
dtm = create_dtm(it, vectorizer)
tcm = create_tcm(it, vectorizer, skip_grams_window = 5)
glove_model = GloVe$new(rank = 50, x_max = 10)
wv = glove_model$fit_transform(tcm, n_iter = 5)
# get average of main and context vectors as proposed in GloVe paper
wv = wv + t(glove_model$components)
rwmd_model = RelaxedWordMoversDistance$new(dtm, wv)
rwms = rwmd_model$sim2(dtm[1:10, ])
head(sort(rwms[1, ], decreasing = T))
# }

Run the code above in your browser using DataCamp Workspace