predict.paragraph2vec: Predict functionalities for a paragraph2vec model

Description

Use the paragraph2vec model to

get the embedding of documents, sentences or words
find the nearest documents/words which are similar to either a set of documents, words or a set of sentences containing words

Usage

# S3 method for paragraph2vec
predict(
  object,
  newdata,
  type = c("embedding", "nearest"),
  which = c("docs", "words", "doc2doc", "word2doc", "word2word", "sent2doc"),
  top_n = 10L,
  encoding = "UTF-8",
  normalize = TRUE,
  ...
)

Arguments

object

a paragraph2vec model as returned by paragraph2vec or read.paragraph2vec

newdata

either a character vector of words, a character vector of doc_id's or a list of sentences where the list elements are words part of the model dictionary. What needs to be provided depends on the argument you provide in which. See the examples.

type

either 'embedding' or 'nearest' to get the embeddings or to find the closest text items. Defaults to 'nearest'.

which

either one of 'docs', 'words', 'doc2doc', 'word2doc', 'word2word' or 'sent2doc' where

'docs' or 'words' can be chosen if type is set to 'embedding' to indicate that newdata contains either doc_id's or words
'doc2doc', 'word2doc', 'word2word', 'sent2doc' can be chosen if type is set to 'nearest' indicating to extract respectively the closest document to a document (doc2doc), the closest document to a word (word2doc), the closest word to a word (word2word) or the closest document to sentences (sent2doc).

top_n

show only the top n nearest neighbours. Defaults to 10, with a maximum value of 100. Only used for type 'nearest'.

encoding

set the encoding of the text elements to the specified encoding. Defaults to 'UTF-8'.

normalize

logical indicating to normalize the embeddings. Defaults to TRUE. Only used for type 'embedding'.

...

not used

Value

depending on the type, you get a different output:

for type nearest: returns a list of data.frames with columns term1, term2, similarity and rank indicating the elements which are closest to the provided newdata
for type embedding: a matrix of embeddings of the words/documents or sentences provided in newdata, rownames are either taken from the words/documents or list names of the sentences. The matrix has always the same number of rows as the length of newdata, possibly with NA values if the word/doc_id is not part of the dictionary

See the examples.

Examples

Run this code

# NOT RUN {
library(tokenizers.bpe)
data(belgium_parliament, package = "tokenizers.bpe")
x <- belgium_parliament
x <- subset(x, language %in% "dutch")
x <- subset(x, nchar(text) > 0 & txt_count_words(text) < 1000)
x$doc_id <- sprintf("doc_%s", 1:nrow(x))
x$text   <- tolower(x$text)
x$text   <- gsub("[^[:alpha:]]", " ", x$text)
x$text   <- gsub("[[:space:]]+", " ", x$text)
x$text   <- trimws(x$text)

## Build model
model <- paragraph2vec(x = x, type = "PV-DM",   dim = 15,  iter = 5)
# }
# NOT RUN {
model <- paragraph2vec(x = x, type = "PV-DBOW", dim = 100, iter = 20)
# }
# NOT RUN {
sentences <- list(
  example = c("geld", "diabetes"),
  hi = c("geld", "diabetes", "koning"),
  test = c("geld"),
  nothing = character(), 
  repr = c("geld", "diabetes", "koning"))
  
## Get embeddings (type =  'embedding')
predict(model, newdata = c("geld", "koning", "unknownword", NA, "</s>", ""), 
               type = "embedding", which = "words")
predict(model, newdata = c("doc_1", "doc_10", "unknowndoc", NA, "</s>"), 
               type = "embedding", which = "docs")
predict(model, sentences, type = "embedding")

## Get most similar items (type =  'nearest')
predict(model, newdata = c("doc_1", "doc_10"), type = "nearest", which = "doc2doc")
predict(model, newdata = c("geld", "koning"), type = "nearest", which = "word2doc")
predict(model, newdata = c("geld", "koning"), type = "nearest", which = "word2word")
predict(model, newdata = sentences, type = "nearest", which = "sent2doc", top_n = 7)

## Similar way on extracting similarities
emb <- predict(model, sentences, type = "embedding")
emb_docs <- as.matrix(model, type = "docs")
paragraph2vec_similarity(emb, emb_docs, top_n = 3)
# }

Run the code above in your browser using DataLab

Description

Usage

Arguments

Value

See Also

Examples