starspace_embedding: Get the document or ngram embeddings

Description

Get the document or ngram embeddings

Usage

starspace_embedding(object, x, type = c("document", "ngram"))

Value

a matrix of embeddings

Arguments

object

an object of class textspace as returned by starspace or starspace_load_model

x

character vector with text to get the embeddings

If type is set to 'document', will assume that a tab or a space is used as separator of each element of x.
If type is set to 'ngram', will assume that a space is used as separator of each element of x.

type

the type of embedding requested. Either one of 'document' or 'ngram'. In case of document, the function returns the document embedding, in case of ngram the function returns the embedding of the provided ngram term. See the details section

Details

document embeddings look to the features (e.g. words) present in x and summate the embeddings of these to get a document embedding and divide this embedding by size^p in case dot similarity is used and the euclidean norm in case cosine similarity is used. Where size is the number of features (e.g. words) in x. If p=1, it's equivalent to taking average of embeddings while when p=0, it's equivalent to taking sum of embeddings. You can set p and similarity in starspace when you train the model.
for ngram embeddings, starspace is using a hashing trick to find out in which bucket the ngram lies and then retrieves the embedding of that. Note that if you specify ngram, you need to make sure x contains less features (e.g. words) then you've set ngram when you trained your model with starspace.

Examples

Run this code

data(dekamer, package = "ruimtehol")
dekamer$text <- strsplit(dekamer$question, "\\W")
dekamer$text <- lapply(dekamer$text, FUN = function(x) x[x != ""])
dekamer$text <- sapply(dekamer$text, 
                       FUN = function(x) paste(x, collapse = " "))

set.seed(123456789)
model <- embed_tagspace(x = tolower(dekamer$text), 
                        y = dekamer$question_theme_main, 
                        similarity = "dot",
                        early_stopping = 0.8, ngram = 1, p = 0.5,
                        dim = 10, minCount = 5)
embedding <- starspace_embedding(model, "federale politie", type = "document")
embedding_dictionary <- as.matrix(model)
embedding
colSums(embedding_dictionary[c("federale", "politie"), ]) / 2^0.5

if (FALSE) {
set.seed(123456789)
model <- embed_tagspace(x = tolower(dekamer$text), 
                        y = dekamer$question_theme_main, 
                        similarity = "cosine",
                        early_stopping = 0.8, ngram = 1, 
                        dim = 10, minCount = 5)
embedding <- starspace_embedding(model, "federale politie", type = "document")
embedding_dictionary <- as.matrix(model)
euclidean_norm <- function(x) sqrt(sum(x^2))
manual <- colSums(embedding_dictionary[c("federale", "politie"), ])
manual / euclidean_norm(manual)
embedding

set.seed(123456789)
model <- embed_tagspace(x = tolower(dekamer$text), 
                        y = dekamer$question_theme_main, 
                        similarity = "dot",
                        early_stopping = 0.8, ngram = 3, p = 0,
                        dim = 10, minCount = 5, bucket = 1)
starspace_embedding(model, "federale politie", type = "document")
starspace_embedding(model, "federale politie", type = "ngram")
}

Run the code above in your browser using DataLab