The text documents are split according to the given categories, and
\(n\)-gram profiles are computed using the specified method, with
options either those used for creating profiles if this is not
NULL, or by combining the options given in ... and
options and merging with the default profile options specified
by the textcat option profile_options using exact
name matching. The method and options employed for building the db
are stored in the db as attributes "method" and
"options", respectively.
There is a c method for combining profile dbs provided
that these have identical options. There are also a [ method
for subscripting and as.matrix and
as.simple_triplet_matrix methods to
“export” the profiles to a dense matrix or the sparse simple
triplet matrix representation provided by package slam,
respectively.
Currently, the only available built-in method is "textcnt",
which has the following options:
n:A numeric vector giving the numbers of characters or bytes in the
\(n\)-gram profiles.
Default: 1 : 5.
split:The regular expression pattern to be used in word splitting.
Default: "[[:space:][:punct:][:digit:]]+".
perl:A logical indicating whether to use Perl-compatible regular
expressions in word splitting.
Default: FALSE.
tolower:A logical indicating whether to transform texts to lower case
(after word splitting).
Default: TRUE.
reduce:A logical indicating whether a representation of \(n\)-grams
more efficient than the one used by Cavnar and Trenkle should be
employed.
Default: TRUE.
useBytes:A logical indicating whether to use byte \(n\)-grams rather than
character \(n\)-grams.
Default: FALSE.
ignore:a character vector of \(n\)-grams to be ignored when computing
\(n\)-gram profiles.
Default: "_" (corresponding to a word boundary).
size:The maximal number of \(n\)-grams used for a profile.
Default: 1000L.
This method uses textcnt in package tau for
computing \(n\)-gram profiles, with n, split,
perl and useBytes corresponding to the respective
textcnt arguments, and option reduce setting argument
marker as needed. \(N\)-grams listed in option ignore
are removed, and only the most frequent remaining ones retained, with
the maximal number given by option size.
Unless the profile db uses bytes rather than characters (i.e., option
useBytes is TRUE), text documents in x containing
non-ASCII characters must declare their encoding (see
Encoding), and will be re-encoded to UTF-8.
Note that option n specifies all numbers of characters
or bytes to be used in the profiles, and not just the maximal number:
e.g., taking n = 3 will create profiles only containing
tri-grams.