paragraph2vec

a data.frame with columns doc_id and text or the path to the file on disk containing training data.
Note that the text column should be of type character, should contain less than 1000 words where space or tab is 
used as a word separator and that the text should not contain newline characters as these are considered document delimiters.

character string with the type of algorithm to use, either one of<ul>
<li>'PV-DM': Distributed Memory paragraph vectors</li>
<li>'PV-DBOW': Distributed Bag Of Words paragraph vectors</li>
</ul>Defaults to 'PV-DBOW'.

type

dimension of the word and paragraph vectors. Defaults to 50.

skip length between words. Defaults to 10 for PV-DM and 5 for PV-DBOW

window

number of training iterations. Defaults to 20.

iter

initial learning rate also known as alpha. Defaults to 0.05

logical indicating to use hierarchical softmax instead of negative sampling. Defaults to FALSE indicating to do negative sampling.

integer with the number of negative samples. Only used in case hs is set to FALSE

negative

threshold for occurrence of words. Defaults to 0.001

sample

integer indicating the number of time a word should occur to be considered as part of the training vocabulary. Defaults to 5.

min_count

number of CPU threads to use. Defaults to 1.

threads

the encoding of <code>x</code> and <code>stopwords</code>. Defaults to 'UTF-8'. 
Calculating the model always starts from files allowing to build a model on large corpora. The encoding argument 
is passed on to <code>file</code> when writing <code>x</code> to hard disk in case you provided it as a data.frame.

encoding

further arguments passed on to the C++ function <code>paragraph2vec_train</code> - for expert use only

Construct a paragraph2vec model on text. 
The algorithm is explained at <a href="https://arxiv.org/pdf/1405.4053.pdf">https://arxiv.org/pdf/1405.4053.pdf</a>.
People also refer to this model as doc2vec.
The model is an extension to the word2vec algorithm, 
where an additional vector for every paragraph is added directly in the training.

Learn vector representations of sentences, paragraphs or documents by using the 'Paragraph Vector' algorithms,
namely the distributed bag of words ('PV-DBOW') and the distributed memory ('PV-DM') model.
The techniques in the package are detailed in the paper "Distributed Representations of Sentences and Documents" by Mikolov et al. (2014), available at <arXiv:1405.4053>.

Jan Wijffels

doc2vec

Distributed Representations of Sentences and Documents

BNOSAC 

hiyijian 

paragraph2vec function

Construct a paragraph2vec model on text. 
The algorithm is explained at <a href='https://arxiv.org/pdf/1405.4053.pdf'>https://arxiv.org/pdf/1405.4053.pdf</a>.
People also refer to this model as doc2vec.
The model is an extension to the word2vec algorithm, 
where an additional vector for every paragraph is added directly in the training.

paragraph2vec: Train a paragraph2vec also known as doc2vec model on text

Description

Usage

Arguments

Value

References

See Also

Examples