Compute total and cumulative corpus coverage fraction of a dictionary.
Usage
word_coverage(object, corpus, ...)
# S3 method for sbo_dictionary
word_coverage(object, corpus, ...)
# S3 method for character
word_coverage(object, corpus, .preprocess = identity, EOS = "", ...)
# S3 method for sbo_kgram_freqs
word_coverage(object, corpus, ...)
# S3 method for sbo_predictions
word_coverage(object, corpus, ...)
Arguments
object
either a character vector, or an object inheriting from one of
the classes sbo_dictionary, sbo_kgram_freqs,
sbo_predtable or sbo_predictor.
The object storing the dictionary for which corpus coverage is to be
computed.
corpus
a character vector.
...
further arguments passed to or from other methods.
a length one character vector. String containing End-Of-Sentence
characters, see kgram_freqs and
sbo_dictionary for further details.
Value
a word_coverage object.
Details
This function computes the corpus coverage fraction of a dictionary,
that is the fraction of words appearing in corpus which are contained in the
original dictionary.
This function is a generic, accepting as object argument any object
storing a dictionary, along with a preprocessing function and a list
of End-Of-Sentence characters. This includes all sbo main classes:
sbo_dictionary, sbo_kgram_freqs, sbo_predtable and
sbo_predictor. When object is a character vector, the preprocessing
function and the End-Of-Sentence characters must be specified explicitly.
The coverage fraction is computed cumulatively, and the dependence of
coverage with respect to maximal rank can be explored through plot()
(see examples below)
# NOT RUN {c <- word_coverage(twitter_dict, twitter_train)
print(c)
summary(c)
# Plot coverage fraction, including the End-Of-Sentence in word counts.plot(c, include_EOS = TRUE)
# }