Generate a term frequency vector from a text document.
termFreq(doc, control = list())
A table of class c("term_frequency", "integer")
with term frequencies
as values and tokens as names.
An object inheriting from TextDocument
or a
character vector.
A list of control options which override default settings.
First, following two options are processed.
tokenize
A function tokenizing a TextDocument
into single tokens, a Span_Tokenizer
,
Token_Tokenizer
, or a string matching one of the
predefined tokenization functions:
"Boost"
for Boost_tokenizer
, or
"MC"
for MC_tokenizer
, or
"scan"
for scan_tokenizer
, or
"words"
for words
.
Defaults to words
.
tolower
Either a logical value indicating whether
characters should be translated to lower case or a custom function
converting characters to lower case. Defaults to
tolower
.
Next, a set of options which are sensitive to the order of
occurrence in the control
list. Options are processed in the
same order as specified. User-specified options have precedence over
the default ordering so that first all user-specified options and
then all remaining options (with the default settings and in the
order as listed below) are processed.
language
A character giving the language (preferably as
IETF language tags, see language in package
NLP) to be used for stopwords
and stemming
if
not provided by doc
.
removePunctuation
A logical value indicating whether
punctuation characters should be removed from
doc
, a custom function which performs punctuation
removal, or a list of arguments for
removePunctuation
. Defaults to FALSE
.
removeNumbers
A logical value indicating whether
numbers should be removed from doc
or a custom function
for number removal. Defaults to FALSE
.
stopwords
Either a Boolean value indicating stopword
removal using default language specific stopword lists shipped
with this package, a character vector holding custom
stopwords, or a custom function for stopword removal. Defaults
to FALSE
.
stemming
Either a Boolean value indicating whether tokens
should be stemmed or a custom stemming function. Defaults to
FALSE
.
Finally, following options are processed in the given order.
dictionary
A character vector to be tabulated
against. No other terms will be listed in the result. Defaults
to NULL
which means that all terms in doc
are
listed.
bounds
A list with a tag local
whose value
must be an integer vector of length 2. Terms that appear less
often in doc
than the lower bound bounds$local[1]
or more often than the upper bound bounds$local[2]
are
discarded. Defaults to list(local = c(1, Inf))
(i.e., every
token will be used).
wordLengths
An integer vector of length 2. Words
shorter than the minimum word length wordLengths[1]
or
longer than the maximum word length wordLengths[2]
are
discarded. Defaults to c(3, Inf)
, i.e., a minimum word
length of 3 characters.
getTokenizers
data("crude")
termFreq(crude[[14]])
strsplit_space_tokenizer <- function(x)
unlist(strsplit(as.character(x), "[[:space:]]+"))
ctrl <- list(tokenize = strsplit_space_tokenizer,
removePunctuation = list(preserve_intra_word_dashes = TRUE),
stopwords = c("reuter", "that"),
stemming = TRUE,
wordLengths = c(4, Inf))
termFreq(crude[[14]], control = ctrl)
Run the code above in your browser using DataCamp Workspace