paragraph2vec: Train a paragraph2vec also known as doc2vec model on text

Description

Construct a paragraph2vec model on text. The algorithm is explained at https://arxiv.org/pdf/1405.4053.pdf. People also refer to this model as doc2vec.
The model is an extension to the word2vec algorithm, where an additional vector for every paragraph is added directly in the training.

Usage

paragraph2vec(
  x,
  type = c("PV-DBOW", "PV-DM"),
  dim = 50,
  window = ifelse(type == "PV-DM", 5L, 10L),
  iter = 5L,
  lr = 0.05,
  hs = FALSE,
  negative = 5L,
  sample = 0.001,
  min_count = 5L,
  threads = 1L,
  encoding = "UTF-8",
  embeddings = matrix(nrow = 0, ncol = dim),
  ...
)

Value

an object of class paragraph2vec_trained which is a list with elements

model: a Rcpp pointer to the model
data: a list with elements file: the training data used, n (the number of words in the training data), n_vocabulary (number of words in the vocabulary) and n_docs (number of documents)
control: a list of the training arguments used, namely min_count, dim, window, iter, lr, skipgram, hs, negative, sample

Arguments

x

a data.frame with columns doc_id and text or the path to the file on disk containing training data.
Note that the text column should be of type character, should contain less than 1000 words where space or tab is used as a word separator and that the text should not contain newline characters as these are considered document delimiters.
The doc_id should not contain spaces.

type

character string with the type of algorithm to use, either one of

'PV-DM': Distributed Memory paragraph vectors
'PV-DBOW': Distributed Bag Of Words paragraph vectors

Defaults to 'PV-DBOW'.

dim

dimension of the word and paragraph vectors. Defaults to 50.

window

skip length between words. Defaults to 10 for PV-DM and 5 for PV-DBOW

iter

number of training iterations. Defaults to 20.

lr

initial learning rate also known as alpha. Defaults to 0.05

hs

logical indicating to use hierarchical softmax instead of negative sampling. Defaults to FALSE indicating to do negative sampling.

negative

integer with the number of negative samples. Only used in case hs is set to FALSE

sample

threshold for occurrence of words. Defaults to 0.001

min_count

integer indicating the number of time a word should occur to be considered as part of the training vocabulary. Defaults to 5.

threads

number of CPU threads to use. Defaults to 1.

encoding

the encoding of x and stopwords. Defaults to 'UTF-8'. Calculating the model always starts from files allowing to build a model on large corpora. The encoding argument is passed on to file when writing x to hard disk in case you provided it as a data.frame.

embeddings

optionally a matrix with pretrained word embeddings which will be used to initialise the word embedding space with (transfer learning). The rownames of this matrix should consist of words. Only words overlapping with the vocabulary extracted from x will be used.

...

further arguments passed on to the C++ function paragraph2vec_train - for expert use only

References

https://arxiv.org/pdf/1405.4053.pdf, https://groups.google.com/g/word2vec-toolkit/c/Q49FIrNOQRo/m/J6KG8mUj45sJ

Examples

Run this code

if(require(tokenizers.bpe) & require(word2vec)){
library(tokenizers.bpe)
## Take data and standardise it a bit
data(belgium_parliament, package = "tokenizers.bpe")
str(belgium_parliament)
x <- subset(belgium_parliament, language %in% "french")
x$text   <- tolower(x$text)
x$text   <- gsub("[^[:alpha:]]", " ", x$text)
x$text   <- gsub("[[:space:]]+", " ", x$text)
x$text   <- trimws(x$text)
x$nwords <- txt_count_words(x$text)
x <- subset(x, nwords < 1000 & nchar(text) > 0)

## Build the model
model <- paragraph2vec(x = x, type = "PV-DM",   dim = 15,  iter = 5)
# \donttest{
model <- paragraph2vec(x = x, type = "PV-DBOW", dim = 100, iter = 20)
# }
str(model)
embedding <- as.matrix(model, which = "words")
embedding <- as.matrix(model, which = "docs")
head(embedding)

## Get vocabulary
vocab <- summary(model, type = "vocabulary",  which = "docs")
vocab <- summary(model, type = "vocabulary",  which = "words")

# \donttest{
## Transfer learning using existing word embeddings
library(word2vec)
w2v   <- word2vec(x$text, dim = 50, type = "cbow", iter = 20, min_count = 5)
emb   <- as.matrix(w2v)
model <- paragraph2vec(x = x, dim = 50, type = "PV-DM", iter = 20, min_count = 5, 
                       embeddings = emb)
# }

## Transfer learning - proof of concept without learning (iter=0, set to higher to learn)
emb       <- matrix(rnorm(30), nrow = 2, dimnames = list(c("en", "met")))
model     <- paragraph2vec(x = x, type = "PV-DM",   dim = 15,  iter = 0, embeddings = emb)
embedding <- as.matrix(model, which = "words", normalize = FALSE)
embedding[c("en", "met"), ]
emb
} # End of main if statement running only if the required packages are installed

Run the code above in your browser using DataLab