Tokenize a set of texts and compute a term frequency matrix or data frame.
term_matrix(x, filter = text_filter(x), weights = NULL,
ngrams = NULL, select = NULL, group = NULL,
transpose = FALSE) term_frame(x, filter = text_filter(x), weights = NULL,
ngrams = NULL, select = NULL, group = NULL)
a text vector to tokenize.
a token filter specifying the tokenization rules.
a numeric vector the same length of x
assigning
weights to each text, or NULL
for unit weights.
an integer vector of n-gram lengths to include, or
NULL
to use the select
argument to determine the
n-gram lengths.
a character vector of terms to count, or NULL
to
count all terms that appear in x
.
if non-NULL
, a factor, character string, or
integer vector the same length of x
specifying the grouping
behavior.
a logical value indicating whether to transpose the result, putting terms as rows instead of columns.
term_matrix
with transpose = FALSE
returns a sparse matrix
in "dgCMatrix"
format with one column for each term and one row for
each input text or (if group
is non-NULL
) for each grouping
level. If filter$select
is non-NULL
, then the column names
will be equal to filter$select
. Otherwise, the columns are assigned
in arbitrary order.
term_matrix
with transpose = TRUE
returns the transpose of
the term matrix, in "dgCMatrix"
format.
term_frame
with group = NULL
returns a data frame with one
row for each entry of the term matrix, and columns "text"
,
"term"
, and "count"
giving the text ID, term, and count.
The "term"
column is a character vector. The "text"
column is a factor with levels equal to names(as_text(x))
;
calling as.integer
on the "text"
column converts from
the factor values to the integer row index in the term matrix.
term_frame
with group
non-NULL
behaves similarly,
but the result instead has columns named "group"
, "term"
,
and "count"
, with "group"
giving the grouping level, as
a factor.
term_matrix
tokenizes a set of texts and computes the occurrence
counts for each term. If weights
is non-NULL
, then each
token in text i
increments the count for the corresponding terms
by weights[i]
; otherwise, each appearance increments the count
by one.
If ngrams
is non-NULL
, then multi-type n-grams are
included in the output for all lengths appearing in the ngrams
argument. If ngrams
is NULL
but select
is
non-NULL
, then all n-grams appearing in the select
set
are included. If both ngrams
and select
are NULL
,
then only unigrams (single type terms) are included.
If group
is NULL
, then the output has one set of term
counts for each input text. Otherwise, we convert group
to
a factor
and compute one set of term counts for each level.
Texts with NA
values for group
get skipped.
# NOT RUN {
text <- c("A rose is a rose is a rose.",
"A Rose is red, a violet is blue!",
"A rose by any other name would smell as sweet.")
term_matrix(text)
# select certain terms
term_matrix(text, select = c("rose", "red", "violet", "sweet"))
# specify a grouping factor
term_matrix(text, group = c("Good", "Bad", "Good"))
# weight the texts
term_matrix(text, weights = c(1, 2, 10),
group = c("Good", "Bad", "Good"))
# include higher-order n-grams
term_matrix(text, ngrams = 1:3)
# select certain multi-type terms
term_matrix(text, select = c("a rose", "a violet", "sweet", "smell"))
# transpose the result
term_matrix(text, ngrams = 1:2, transpose = TRUE)[1:10, ] # first 10 rows
# data frame
head(term_frame(text), n = 10) # first 10 rows
# with grouping
term_frame(text, group = c("Good", "Bad", "Good"))
# taking names from the input
term_frame(c(a = "One sentence.", b = "Another", c = "!!"))
# }
Run the code above in your browser using DataLab