Last chance! 50% off unlimited learning
Sale ends in
Tokenize a set of texts and tabulate the term occurrence frequencies.
term_counts(x, filter = text_filter(x), weights = NULL,
ngrams = NULL, min_count = NULL, max_count = NULL,
min_support = NULL, max_support = NULL, types = FALSE)
a text vector to tokenize.
a token filter specifying the tokenization rules.
a numeric vector the same length of x
assigning
weights to each text, or NULL
for unit weights.
an integer vector of n-gram lengths to include, or
NULL
for length-1 n-grams only.
a numeric scalar giving the minimum term count to include
in the output, or NULL
for no minimum count.
a numeric scalar giving the maximum term count to include
in the output, or NULL
for no maximum count.
a numeric scalar giving the minimum term support to
include in the output, or NULL
for no minimum support.
a numeric scalar giving the maximum term support to
include in the output, or NULL
for no maximum support.
a logical value indicating whether to include columns for the types that make up the terms.
A data frame with columns named term
, count
, and
support
, with one row for each appearing term. Rows are sorted
in descending order according to support
and then count
,
with ties broken lexicographically by term
, using the
character ordering determined by the current locale
(see Comparison
for details).
If types = TRUE
, then the result also includes columns named
type1
, type2
, etc. for the types that make up the
term.
term_counts
tokenizes a set of texts and computes the occurrence
counts and supports for each term. The ‘count’ is the number of
occurrences of the term across all texts; the ‘support’ is the
number of texts containing the term. If weights
is
non-NULL
, then each token in text i
increments the
count for the corresponding term by weights[i]
;
otherwise, each appearance increments the count by one. Likewise,
for non-NULL
weights, an appearance of a term in text i
increments its support by weight[i]
(once, not for each occurrence
in the text).
To include multi-type terms, specify the designed term lengths using
the ngrams
argument.
# NOT RUN {
term_counts("A rose is a rose is a rose.")
# remove punctuation and stop words
term_counts("A rose is a rose is a rose.",
text_filter(drop_symbol = TRUE, drop = stopwords("english")))
# weight the texts
term_counts(c("A rose is a rose is a rose.",
"A Rose is red, a violet is blue!"),
weights = c(100, 1))
# unigrams, bigrams, and trigrams
term_counts("A rose is a rose is a rose.", ngrams = 1:3)
# also include the type information
term_counts("A rose is a rose is a rose.", ngrams = 1:3, types = TRUE)
# }
Run the code above in your browser using DataLab