text2vec (version 0.3.0)

glove: Fit a GloVe word-embedded model

Description

This function trains a GloVe word-embeddings model via fully asynchronous and parallel AdaGrad.

Usage

glove(tcm, vocabulary_size, word_vectors_size, x_max, num_iters, shuffle_seed = NA_integer_, learning_rate = 0.05, verbose = TRUE, convergence_threshold = 0, grain_size = 100000L, max_cost = 10, alpha = 0.75, ...)
"glove"(tcm, vocabulary_size = nrow(tcm), word_vectors_size, x_max, num_iters, shuffle_seed = NA_integer_, learning_rate = 0.05, verbose = TRUE, convergence_threshold = -1, grain_size = 100000L, max_cost = 10, alpha = 0.75, ...)
"glove"(tcm, ...)

Arguments

tcm
an object which represents a term-co-occurrence matrix, which is used in training. At the moment only dgTMatrix or objects coercible to a dgTMatrix) are supported. In future releases we will add support for out-of-core learning and streaming a TCM from disk.
vocabulary_size
number of words in in the term-co-occurrence matrix
word_vectors_size
desired dimenson for word vectors
x_max
maximum number of co-occurrences to use in the weighting function. See the GloVe paper for details: http://nlp.stanford.edu/pubs/glove.pdf.
num_iters
number of AdaGrad epochs
shuffle_seed
integer seed. Use NA_integer_ to turn shuffling off. A seed defines shuffling before each SGD iteration. Generelly shuffling is a good idea for stochastic-gradient descent, but from my experience in this particular case it does not improve convergence. By default there is no shuffling. Please report if you find that shuffling improves your score.
learning_rate
learning rate for SGD. I do not recommend that you modify this parameter, since AdaGrad will quickly adjust it to optimal.
verbose
logical whether to display training inforamtion
convergence_threshold
defines early stopping strategy. We stop fitting when one of two following conditions will be satisfied: (a) we have used all iterations, or (b) cost_previous_iter / cost_current_iter - 1 < convergence_threshold.
grain_size
I do not recommend adjusting this paramenter. This is the grain_size for RcppParallel::parallelReduce. For details, see http://rcppcore.github.io/RcppParallel/#grain-size.
max_cost
the maximum absolute value of calculated gradient for any single co-occurrence pair. Try to set this to a smaller value if you have problems with numerical stability.
alpha
the alpha in weighting function formula : $f(x) = 1 if x > x_max; else (x/x_max)^alpha$
...
arguments passed to other methods (not used at the moment).

Methods (by class)

  • dgTMatrix: fits GloVe model on a dgTMatrix, a sparse matrix in triplet form
  • Matrix: Fits a GloVe model on a Matrix input

See Also

http://nlp.stanford.edu/projects/glove/

Examples

Run this code
## Not run: 
# library(readr)
# temp <- tempfile()
# download.file('http://mattmahoney.net/dc/text8.zip', temp)
# text8 <- read_lines(unz(temp, "text8"))
# it <- itoken(text8, preprocess_function = identity,
#              tokenizer = function(x) strsplit(x, " ", TRUE))
# vocab <- vocabulary(it) %>%
#  prune_vocabulary(term_count_min = 5)
# 
# it <- itoken(text8, preprocess_function = identity,
#              tokenizer = function(x) strsplit(x, " ", TRUE))
# 
# tcm <- create_tcm(it, vocab_vectorizer(vocab, grow_dtm = FALSE, skip_grams_window = 5L))
# 
# # use the following command to manually set number of threads (if you want)
# # by default glove will use all available CPU cores
# # RcppParallel::setThreadOptions(numThreads = 8)
# fit <- glove(tcm = tcm, shuffle_seed = 1L, word_vectors_size = 50,
#               x_max = 10, learning_rate = 0.2,
#               num_iters = 50, grain_size = 1e5,
#               max_cost = 100, convergence_threshold = 0.005)
# word_vectors <- fit$word_vectors[[1]] + fit$word_vectors[[2]]
# rownames(word_vectors) <- rownames(tcm)
# qlst <- prepare_analogy_questions('./questions-words.txt', rownames(word_vectors))
# res <- check_analogy_accuracy(questions_lst = qlst, m_word_vectors = word_vectors)
# ## End(Not run)

Run the code above in your browser using DataCamp Workspace