vectorizers

0th

Percentile

Vocabulary and hash vectorizers

This function creates a text vectorizer function which is used in constructing a corpus.

Usage
vocab_vectorizer(vocabulary, grow_dtm = TRUE, skip_grams_window = 0L)
hash_vectorizer(hash_size = 2^18, ngram = c(1L, 1L), signed_hash = FALSE, grow_dtm = TRUE, skip_grams_window = 0L)
Arguments
vocabulary
text2vec_vocabulary object, see create_vocabulary.
grow_dtm
logical Should we grow the document-term matrix during corpus construction or not.
skip_grams_window
integer window for term-co-occurence matrix construction. A value of 0L does not construct the TCM.
hash_size
integer The number of of hash-buckets for the feature hashing trick. The number must be greater than 0, and preferably it will be a power of 2.
ngram
integer vector. The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that ngram_min
signed_hash
logical, indicating whether to use a signed hash-function to reduce collisions when hashing.
Value

A vectorizer function

See Also

create_corpus create_dtm create_tcm create_vocabulary

Aliases
  • hash_vectorizer
  • vectorizers
  • vocab_vectorizer
Examples
data("movie_review")
N <- 100
vectorizer <- hash_vectorizer(2 ^ 18, c(1L, 2L))
it <- itoken(movie_review$review[1:N], preprocess_function = tolower,
             tokenizer = word_tokenizer, chunks_number = 10)
corpus <- create_corpus(it, vectorizer)
hash_dtm <- get_dtm(corpus)

it <- itoken(movie_review$review[1:N], preprocess_function = tolower,
             tokenizer = word_tokenizer, chunks_number = 10)
v <- create_vocabulary(it, c(1L, 1L) )

vectorizer <- vocab_vectorizer(v)

it <- itoken(movie_review$review[1:N], preprocess_function = tolower,
             tokenizer = word_tokenizer, chunks_number = 10)

corpus <- create_corpus(it, vectorizer)
voacb_dtm <- get_dtm(corpus)
Documentation reproduced from package text2vec, version 0.3.0, License: MIT + file LICENSE

Community examples

Looks like there are no examples yet.