Learn R Programming

text2vec (version 0.4.0)

GlobalVectors: Creates Global Vectors word-embeddings model.

Description

Class for GloVe word-embeddings model. It can be trained via fully can asynchronous and parallel AdaGrad with $fit() method.

Usage

GloVe

Format

R6Class object.

Fields

dump_every_n

integer = 0L by default. Defines frequency of dumping word vectors. For example user can ask to dump word vectors each 5 iteration.

shuffle

logical = FALSE by default. Defines shuffling before each SGD iteration. Generelly shuffling is a good idea for stochastic-gradient descent, but from my experience in this particular case it does not improve convergence.

grain_size

integer = 1e5L by default. This is the grain_size for RcppParallel::parallelReduce. For details, see http://rcppcore.github.io/RcppParallel/#grain-size. We don't recommend to change this paramenter.

verbose

logical = TRUE whether to display training inforamtion

Usage

For usage details see Methods, Arguments and Examples sections.

glove = GlobalVectors$new(word_vectors_size, vocabulary, x_max)
glove$fit(x, n_iter)
glove$get_word_vectors()
glove$dump_model()

Methods

$new(word_vectors_size, vocabulary, x_max, learning_rate = 0.15, max_cost = 10, alpha = 0.75, lambda = 0, shuffle = FALSE, initial = NULL)

Constructor for Global vectors model. For description of arguments see Arguments section.

$fit(x, n_iter, convergence_tol = -1)

fit Glove model to input matrix x

$get_word_vectors()

get word vector - obtain GloVe word embeddings

$dump_model()

get model internals - word vectors and biases for main and context words

$get_history

get history of SGD costs and word vectors (if dump_every_n > 0)

Arguments

glove

A GloVe object

x

An input term co-occurence matrix. Preferably in dgTMatrix format

n_iter

integer number of SGD iterations

word_vectors_size

desired dimenson for word vectors

vocabulary

character vector or instance of text2vec_vocabulary class. Each word should correspond to dimension of co-occurence matrix.

x_max

integer maximum number of co-occurrences to use in the weighting function. see the GloVe paper for details: http://nlp.stanford.edu/pubs/glove.pdf

learning_rate

numeric learning rate for SGD. I do not recommend that you modify this parameter, since AdaGrad will quickly adjust it to optimal

convergence_tol

numeric = -1 defines early stopping strategy. We stop fitting when one of two following conditions will be satisfied: (a) we have used all iterations, or (b) cost_previous_iter / cost_current_iter - 1 < convergence_tol. By default perform all iterations.

max_cost

numeric = 10 the maximum absolute value of calculated gradient for any single co-occurrence pair. Try to set this to a smaller value if you have problems with numerical stability

alpha

numeric = 0.75 the alpha in weighting function formula : \(f(x) = 1 if x > x_max; else (x/x_max)^alpha\)

lambda

numeric = 0.0, L1 regularization coefficient. 0 = vanilla GloVe, corrsesponds to original paper and implementation. lambda >0 corresponds to text2vec new feature and different SGD algorithm. From our experience small lambda (like lambda = 1e-5) usually produces better results that vanilla GloVe on small corpuses

initial

NULL - word vectors and word biases will be initialized randomly. Or named list which contains w_i, w_j, b_i, b_j values - initial word vectors and biases. This is useful for fine-tuning. For example one can pretrain model on large corpus (such as wikipedia dump) and then fine tune on smaller task-specific dataset

See Also

http://nlp.stanford.edu/projects/glove/

Examples

Run this code
# NOT RUN {
temp = tempfile()
download.file('http://mattmahoney.net/dc/text8.zip', temp)
text8 = readLines(unz(temp, "text8"))
it = itoken(text8)
vocab = create_vocabulary(it) %>%
 prune_vocabulary(term_count_min = 5)
v_vect = vocab_vectorizer(vocab, grow_dtm = FALSE, skip_grams_window = 5L)
tcm = create_tcm(it, v_vect)

glove_model = GloVe(word_vectors_size = 50, vocabulary = vocab, x_max = 10, learning_rate = .25)
# fit model and get word vectors
fit(tcm, glove_model, n_iter = 10)
wv = glove_model$get_word_vectors()
# }

Run the code above in your browser using DataLab