word2vec: Word2vec model

Description

Train a Word2vec model (Mikolov et al., 2023) in different architectures on a quanteda::tokens object.

Usage

word2vec(
  x,
  dim = 50,
  type = c("cbow", "skip-gram"),
  min_count = 5L,
  window = ifelse(type == "cbow", 5L, 10L),
  iter = 10L,
  alpha = 0.05,
  use_ns = TRUE,
  ns_size = 5L,
  sample = 0.001,
  verbose = FALSE,
  ...
)

Value

Returns a textmodel_wordvector object with the following elements:

vectors: a matrix for word vectors.
dim: the size of the word vectors.
type: the architecture of the model.
frequency: the frequency of words in x.
window: the size of the word window.
iter: the number of iterations in model training.
alpha: the initial learning rate.
use_ns: the use of negative sampling.
ns_size: the size of negative samples.
concatenator: the concatenator in x.
call: the command used to execute the function.
version: the version of the wordvector package.

Arguments

x: a quanteda::tokens object.
dim: the size of the word vectors.
type: the architecture of the model; either "cbow" (continuous back of words) or "skip-gram".
min_count: the minimum frequency of the words. Words less frequent than this in x are removed before training.
window: the size of the word window. Words within this window are considered to be the context of a target word.
iter: the number of iterations in model training.
alpha: the initial learning rate.
use_ns: if TRUE, negative sampling is used. Otherwise, hierarchical softmax is used.
ns_size: the size of negative samples. Only used when use_ns = TRUE.
sample: the rate of sampling of words based on their frequency. Sampling is disabled when sample = 1.0
verbose: if TRUE, print the progress of training.
...: additional arguments.

Details

User can changed the number of processors used for the parallel computing via options(wordvector_threads).

References

Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. https://arxiv.org/abs/1310.4546.

Examples

Run this code

# \donttest{
library(quanteda)
library(wordvector)

# pre-processing
corp <- data_corpus_news2014 
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE) %>% 
   tokens_remove(stopwords("en", "marimo"), padding = TRUE) %>% 
   tokens_select("^[a-zA-Z-]+$", valuetype = "regex", case_insensitive = FALSE,
                 padding = TRUE) %>% 
   tokens_tolower()

# train word2vec
w2v <- word2vec(toks, dim = 50, type = "cbow", min_count = 5, sample = 0.001)
head(similarity(w2v, c("berlin", "germany", "france"), mode = "word"))
analogy(w2v, ~ berlin - germany + france)
# }

Run the code above in your browser using DataLab