vocab: Build and manipulate vocabularies

Description

vocab() creates a vocabulry from a text corpus; vocab_update() and vocab_prune(), respectively, update and prune an existing vocabulary.

vocab_embed() subsets a (commonly large) pre-trained word-vector matrix into a smaller, one vector per term, embedding matrix.

vocab_embed() is commonly used in conjunction with sequence generators (timat() and tiseq()). When a term in a corpus is not present in a vocabulary (aka unknown), it is hashed into one of the nbuckets buckets. Embeddings which are hashed into same bucket are averaged to produce the embedding for that bucket. Maximum number of embeddings to average per bucket is controled with max_in_bucket parameter.

Similarly, when a term from the vocabulary is not present in the embedding matrix (aka missing) max_in_bucket embeddings are averaged to produce the missing embedding. Different buckets are used for "missing" and "unknown" embeddings because nbuckets can be 0.

Usage

vocab(corpus, ngram = c(1, 1), ngram_sep = "_")
vocab_update(vocab, corpus)
vocab_prune(vocab, max_terms = Inf, term_count_min = 1L,
  term_count_max = Inf, doc_proportion_min = 0, doc_proportion_max = 1,
  doc_count_min = 1L, doc_count_max = Inf, nbuckets = attr(vocab,
  "nbuckets"))
vocab_embed(vocab, embeddings, nbuckets = attr(vocab, "nbuckets"),
  max_in_bucket = 30)

Arguments

corpus

list of character vectors

ngram

a vector of length 2 of the form c(min_ngram, max_ngram) or a singleton max_ngram which is equivalent to c(1L, max_ngram).

ngram_sep

separator to link terms within ngrams.

vocab

data.frame obtained from a call to vocab().

max_terms

max number of terms to preserve

term_count_min

keep terms occurring at least this many times over all docs

term_count_max

keep terms occurring at most this many times over all docs

doc_count_min, doc_proportion_min

keep terms appearing in at least this many docs

doc_count_max, doc_proportion_max

keep terms appearing in at most this many docs

nbuckets

How many unknown buckets to create along the remaining terms of the pruned vocab. All pruned terms will be hashed into this many buckets and the corresponding statistics (term_count and doc_count) updated.

embeddings

embeddings matrix. The terms dimension must be named. If both colnames() and rownames() are non-null, dimension with more elements is considered term-dimension.

max_in_bucket

At most this many embedding vectors will be averaged into each unknown or missing bucket (see details). Lower number results in faster processing. For large nbuckets this number might not be reached due to the finiteness of the embeddings vocabulary, or even result in 0 embeddings being hashed into a bucket producing [0 0 ...] embeddings for some buckets.

Examples

Run this code

# NOT RUN {
corpus <-
   list(a = c("The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"), 
        b = c("the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog",
              "the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"))

vocab(corpus)
vocab(corpus, ngram = 3)
vocab(corpus, ngram = c(2, 3))

v <- vocab(corpus)

extra_corpus <- list(extras = c("apples", "oranges"))
v <- vocab_update(v, extra_corpus)
v

vocab_prune(v, max_terms = 7)
vocab_prune(v, term_count_min = 2)
vocab_prune(v, max_terms = 7, nbuckets = 2)

v2 <- vocab_prune(v, max_terms = 7, nbuckets = 2)
enames <- c("the", "quick", "brown", "fox", "jumps")
emat <- matrix(rnorm(50), nrow = 5,
               dimnames = list(enames, NULL))

vocab_embed(v2, emat)
vocab_embed(v2, t(emat)) # automatic detection of the orientation

vembs <- vocab_embed(v2, emat)
all(vembs[enames, ] == emat[enames, ])
# }

Run the code above in your browser using DataLab