Learn R Programming

corpus (version 0.6.0)

term_counts: Term Frequencies

Description

Tokenize a set of texts and tabulate the term occurrence frequencies.

Usage

term_counts(x, filter = token_filter(), weights = NULL,
                ngrams = NULL, min = NA, max = NA, limit = NA,
                types = FALSE)

Arguments

x

a text vector to tokenize.

filter

a token filter specifying the tokenization rules.

weights

a numeric vector the same length of x assigning weights to each text, or NULL for unit weights.

ngrams

an integer vector of n-gram lengths to include, or NULL for length-1 n-grams only.

min

a numeric scalar giving the minimum term count to include in the output, or NA for no minimum count.

max

a numeric scalar giving the maximum term count to include in the output, or NA for no maximum count.

limit

an integer scalar giving the maximum number of terms to include in the output, or NA for no maximum number of terms.

types

a logical value indicating whether to include columns for the types that make up the terms.

Value

A data frame with columns named term and count, with one row for each appearing term. Rows are sorted in descending order according to count, with ties broken lexicographically using the term, using the character ordering determined by the current locale (see Comparison for details).

If types = TRUE, then the result also includes columns named type1, type2, etc. for the types that make up the term.

Details

term_counts tokenizes a set of texts and computes the occurrence counts for each term. If weights is non-NULL, then each token in text i increments the count for the corresponding term by weights[i]; otherwise, each appearance increments the count by one.

To include multi-type terms, specify the designed term lengths using the ngrams argument.

See Also

tokens, term_matrix.

Examples

Run this code
    term_counts("A rose is a rose is a rose.")

    # remove punctuation and stop words
    term_counts("A rose is a rose is a rose.",
                token_filter(drop_symbol = TRUE, drop = stopwords("english")))

    # weight the texts
    term_counts(c("A rose is a rose is a rose.",
                  "A Rose is red, a violet is blue!"),
                weights = c(100, 1))

    # unigrams, bigrams, and trigrams
    term_counts("A rose is a rose is a rose.", ngrams = 1:3)

    # also include the type information
    term_counts("A rose is a rose is a rose.", ngrams = 1:3, types = TRUE)

Run the code above in your browser using DataLab