Takes a sequence (list of indexes of words), returns list of couples (word_index,
other_word index) and labels (1s or 0s), where label = 1 if 'other_word'
belongs to the context of 'word', and label=0 if 'other_word' is randomly
sampled
skipgrams(sequence, vocabulary_size, window_size = 4, negative_samples = 1,
shuffle = TRUE, categorical = FALSE, sampling_table = NULL,
seed = NULL)a word sequence (sentence), encoded as a list of word indices
(integers). If using a sampling_table, word indices are expected to match
the rank of the words in a reference dataset (e.g. 10 would encode the
10-th most frequently occuring token). Note that index 0 is expected to be
a non-word and will be skipped.
int. maximum possible word index + 1
int. actually half-window. The window of a word wi will be
[i-window_size, i+window_size+1]
float >= 0. 0 for no negative (=random) samples. 1 for same number as positive samples. etc.
whether to shuffle the word couples before returning them.
bool. if FALSE, labels will be integers (eg. [0, 1, 1 .. ]),
if TRUE labels will be categorical eg. [[1,0],[0,1],[0,1] .. ]
[[1,0]: R:[1,0 [0,1]: R:0,1 [0,1]: R:0,1
1D array of size vocabulary_size where the entry i
encodes the probabibily to sample a word of rank i.
Random seed
List of couples, labels where:
couples is a list of 2-element integer vectors: [word_index, other_word_index].
labels is an integer vector of 0 and 1, where 1 indicates that other_word_index
was found in the same window as word_index, and 0 indicates that other_word_index
was random.
if categorical is set to TRUE, the labels are categorical, ie. 1 becomes [0,1],
and 0 becomes [1, 0].
Other text preprocessing: make_sampling_table,
pad_sequences,
text_hashing_trick,
text_one_hot,
text_to_word_sequence