text2vec (version 0.4.0)

vectorizers: Vocabulary and hash vectorizers

Description

This function creates a text vectorizer function which is used in constructing a dtm/tcm/corpus.

Usage

vocab_vectorizer(vocabulary, grow_dtm = TRUE, skip_grams_window = 0L)

hash_vectorizer(hash_size = 2^18, ngram = c(1L, 1L), signed_hash = FALSE, grow_dtm = TRUE, skip_grams_window = 0L)

Arguments

vocabulary

text2vec_vocabulary object, see create_vocabulary.

grow_dtm

logical Should we grow the document-term matrix during corpus construction or not.

skip_grams_window

integer window for term-co-occurence matrix construction. skip_grams_window should be > 0 if you plan to use vectorizer in create_tcm function. Value of 0L means to not construct the TCM.

hash_size

integer The number of of hash-buckets for the feature hashing trick. The number must be greater than 0, and preferably it will be a power of 2.

ngram

integer vector. The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that ngram_min <= n <= ngram_max will be used.

signed_hash

logical, indicating whether to use a signed hash-function to reduce collisions when hashing.

Value

A vectorizer function

See Also

create_dtm create_tcm create_vocabulary create_corpus

Examples

Run this code
# NOT RUN {
data("movie_review")
N = 100
vectorizer = hash_vectorizer(2 ^ 18, c(1L, 2L))
it = itoken(movie_review$review[1:N], preprocess_function = tolower,
             tokenizer = word_tokenizer, chunks_number = 10)
corpus = create_corpus(it, vectorizer)
hash_dtm = get_dtm(corpus)

it = itoken(movie_review$review[1:N], preprocess_function = tolower,
             tokenizer = word_tokenizer, chunks_number = 10)
v = create_vocabulary(it, c(1L, 1L) )

vectorizer = vocab_vectorizer(v)

it = itoken(movie_review$review[1:N], preprocess_function = tolower,
             tokenizer = word_tokenizer, chunks_number = 10)

corpus = create_corpus(it, vectorizer)
voacb_dtm = get_dtm(corpus)
# }

Run the code above in your browser using DataLab