term_matrix: Term Frequency Matrix

Description

Tokenize a set of texts and compute a term frequency matrix, with one column for each term.

Usage

term_matrix(x, filter = text_filter(), weights = NULL, group = NULL)

Arguments

a text vector to tokenize.

filter

a text_filter specifying the tokenization rules.

weights

a numeric vector the same length of x assigning weights to each text, or NULL for unit weights.

group

if non-NULL, a factor, character string, or integer vector the same length of x specifying the grouping behavior.

Value

A sparse matrix in "dgCMatrix" format with one column for each term and one row for each input text or (if group is non-NULL) on row for each grouping level.

If filter$select is non-NULL, then the column names will be equal to filter$select. Otherwise, the columns are assigned in arbitrary order.

Details

term_matrix tokenizes a set of texts and computes the occurrence counts for each term. If weights is non-NULL, then each token in text i increments the count for the corresponding term by weights[i]; otherwise, each appearance increments the count by one.

If group is NULL, then the output has one row for each input text. Otherwise, we convert group to a factor and compute one row for each level. Texts with NA values for group get skipped.

Examples

Run this code

    text <- c("A rose is a rose is a rose.",
              "A Rose is red, a violet is blue!",
              "A rose by any other name would smell as sweet.")
    term_matrix(text)

    # select certain terms
    f <- text_filter(select = c("rose", "red", "violet", "sweet"))
    term_matrix(text, f)

    # specify a grouping factor
    term_matrix(text, f, group = c("Good", "Bad", "Good"))

    # weight the texts
    term_matrix(text, f, weights = c(1, 2, 10),
                group = c("Good", "Bad", "Good"))