Learn R Programming

wordvector (version 0.1.0)

word2vec: Word2vec model

Description

Train a Word2vec model (Mikolov et al., 2023) in different architectures on a quanteda::tokens object.

Usage

word2vec(
  x,
  dim = 50,
  type = c("cbow", "skip-gram"),
  min_count = 5L,
  window = ifelse(type == "cbow", 5L, 10L),
  iter = 10L,
  alpha = 0.05,
  use_ns = TRUE,
  ns_size = 5L,
  sample = 0.001,
  verbose = FALSE,
  ...
)

Value

Returns a textmodel_wordvector object with the following elements:

vectors

a matrix for word vectors.

dim

the size of the word vectors.

type

the architecture of the model.

frequency

the frequency of words in x.

window

the size of the word window.

iter

the number of iterations in model training.

alpha

the initial learning rate.

use_ns

the use of negative sampling.

ns_size

the size of negative samples.

concatenator

the concatenator in x.

call

the command used to execute the function.

version

the version of the wordvector package.

Arguments

x

a quanteda::tokens object.

dim

the size of the word vectors.

type

the architecture of the model; either "cbow" (continuous back of words) or "skip-gram".

min_count

the minimum frequency of the words. Words less frequent than this in x are removed before training.

window

the size of the word window. Words within this window are considered to be the context of a target word.

iter

the number of iterations in model training.

alpha

the initial learning rate.

use_ns

if TRUE, negative sampling is used. Otherwise, hierarchical softmax is used.

ns_size

the size of negative samples. Only used when use_ns = TRUE.

sample

the rate of sampling of words based on their frequency. Sampling is disabled when sample = 1.0

verbose

if TRUE, print the progress of training.

...

additional arguments.

Details

User can changed the number of processors used for the parallel computing via options(wordvector_threads).

References

Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. https://arxiv.org/abs/1310.4546.

Examples

Run this code
# \donttest{
library(quanteda)
library(wordvector)

# pre-processing
corp <- data_corpus_news2014 
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE) %>% 
   tokens_remove(stopwords("en", "marimo"), padding = TRUE) %>% 
   tokens_select("^[a-zA-Z-]+$", valuetype = "regex", case_insensitive = FALSE,
                 padding = TRUE) %>% 
   tokens_tolower()

# train word2vec
w2v <- word2vec(toks, dim = 50, type = "cbow", min_count = 5, sample = 0.001)
head(similarity(w2v, c("berlin", "germany", "france"), mode = "word"))
analogy(w2v, ~ berlin - germany + france)
# }

Run the code above in your browser using DataLab