Doc2Vec: Conversion of text documents to word-vector-representation features ( Doc2Vec )

Description

Conversion of text documents to word-vector-representation features ( Doc2Vec )

Usage

# utl <- Doc2Vec$new(token_list = NULL, word_vector_FILE = NULL, 
       #                    print_every_rows = 10000, verbose = FALSE,
       
       #                    copy_data = FALSE)

Arguments

token_list

either NULL or a list of tokenized text documents

word_vector_FILE

a valid path to a text file, where the word-vectors are saved

print_every_rows

a numeric value greater than 1 specifying the print intervals. Frequent output in the R session can slow down the function especially in case of big files.

verbose

either TRUE or FALSE. If TRUE then information will be printed out in the R session.

method

a character string specifying the method to use. One of sum_sqrt, min_max_norm or idf. See the details section for more information.

global_term_weights

either NULL or the output of the global_term_weights method of the textTinyR package. See the details section for more information.

threads

a numeric value specifying the number of cores to run in parallel

copy_data

either TRUE or FALSE. If FALSE then a pointer will be created and no copy of the initial data takes place (memory efficient especially for big datasets). This is an alternative way to pre-process the data.

Value

a matrix

Format

An object of class R6ClassGenerator of length 24.

Methods

Doc2Vec$new(token_list = NULL, word_vector_FILE = NULL, print_every_rows = 10000, verbose = FALSE, copy_data = FALSE)

--------------

doc2vec_methods(method = "sum_sqrt", global_term_weights = NULL, threads = 1)

--------------

pre_processed_wv()

Details

the pre_processed_wv method should be used after the initialization of the Doc2Vec class, if the copy_data parameter is set to TRUE, in order to inspect the pre-processed word-vectors.

The global_term_weights method is part of the sparse_term_matrix R6 class of the textTinyR package. One can come to the correct global_term_weights by using the sparse_term_matrix class and by setting the tf_idf parameter to FALSE and the normalize parameter to NULL. In Doc2Vec class, if method equals to idf then the global_term_weights parameter should not be equal to NULL.

Explanation of the various methods :

sum_sqrt: Assuming that a single sublist of the token list will be taken into consideration : the wordvectors of each word of the sublist of tokens will be accumulated to a vector equal to the length of the wordvector (INITIAL_WORD_VECTOR). Then a scalar will be computed using this INITIAL_WORD_VECTOR in the following way : the INITIAL_WORD_VECTOR will be raised to the power of 2.0, then the resulted wordvector will be summed and the square-root will be calculated. The INITIAL_WORD_VECTOR will be divided by the resulted scalar
min_max_norm: Assuming that a single sublist of the token list will be taken into consideration : the wordvectors of each word of the sublist of tokens will be first min-max normalized and then will be accumulated to a vector equal to the length of the initial wordvector
idf: Assuming that a single sublist of the token list will be taken into consideration : the word-vector of each term in the sublist will be multiplied with the corresponding idf of the global weights term

Examples

Run this code

# NOT RUN {
library(textTinyR)

#---------------------------------
# tokenized text in form of a list
#---------------------------------

tok_text = list(c('the', 'result', 'of'), c('doc2vec', 'are', 'vector', 'features'))

#-------------------------
# path to the word vectors
#-------------------------

PATH = system.file("example_files", "word_vecs.txt", package = "textTinyR")


init = Doc2Vec$new(token_list = tok_text, word_vector_FILE = PATH)


out = init$doc2vec_methods(method = "sum_sqrt")
# }

Run the code above in your browser using DataLab