term_matrix: Term Frequency Matrix

Description

Tokenize a set of texts and compute a term frequency matrix, with one column for each term.

Usage

term_matrix(x, filter = token_filter(), weights = NULL,
                ngrams = NULL, select = NULL, group = NULL)

Arguments

a text vector to tokenize.

filter

a token filter specifying the tokenization rules.

weights

a numeric vector the same length of x assigning weights to each text, or NULL for unit weights.

ngrams

an integer vector of n-gram lengths to include, or NULL to use the select argument to determine the n-gram lengths.

select

a character vector of terms to count, or NULL to count all terms that appear in x.

group

if non-NULL, a factor, character string, or integer vector the same length of x specifying the grouping behavior.

Value

A sparse matrix in "dgCMatrix" format with one column for each term and one row for each input text or (if group is non-NULL) on row for each grouping level.

If filter$select is non-NULL, then the column names will be equal to filter$select. Otherwise, the columns are assigned in arbitrary order.

Details

term_matrix tokenizes a set of texts and computes the occurrence counts for each term. If weights is non-NULL, then each token in text i increments the count for the corresponding terms by weights[i]; otherwise, each appearance increments the count by one.

If ngrams is non-NULL, then multi-type n-grams are included in the output for all lengths appearing in the ngrams argument. If ngrams is NULL but select is non-NULL, then all n-grams appearing in the select set are included. If both ngrams and select are NULL, then only unigrams (single type terms) are included.

If group is NULL, then the output has one row for each input text. Otherwise, we convert group to a factor and compute one row for each level. Texts with NA values for group get skipped.

Examples

Run this code

    text <- c("A rose is a rose is a rose.",
              "A Rose is red, a violet is blue!",
              "A rose by any other name would smell as sweet.")
    term_matrix(text)

    # select certain terms
    term_matrix(text, select = c("rose", "red", "violet", "sweet"))

    # specify a grouping factor
    term_matrix(text, group = c("Good", "Bad", "Good"))

    # weight the texts
    term_matrix(text, weights = c(1, 2, 10),
                group = c("Good", "Bad", "Good"))

    # include higher-order n-grams
    term_matrix(text, ngrams = 1:3)

    # select certain multi-type terms
    term_matrix(text, select = c("a rose", "a violet", "sweet", "smell"))