50% off | Unlimited Data & AI Learning

Last chance! 50% off unlimited learning

Sale ends in


textcat (version 0.0-1)

textcat_profile_db: Textcat Profile Dbs

Description

Create n-gram profile dbs for text categorization.

Usage

textcat_profile_db(x, id, ...)

Arguments

x
a character vector of text documents, or an Robject of text documents extractable via as.character.
id
a character vector giving the categories of the texts. Recycled to the length of x.
...
further arguments specifying the options used for creating the n-gram profiles, see textcat_options for the (current) default options. The names of the arguments are partially matched a

Details

The text documents are split according to the given categories, and n-gram profiles are computed via textcnt in package tau, with options n, split and useBytes corresponding to the respective arguments, and option reduce setting argument marker as needed. N-grams listed in option ignore are removed, and only the most frequent remaining ones retained, with the maximal number given by option size. The options employed for building the db are stored in the db.

There is a c method for combining profile dbs provided that these have identical options.

Unless the profile db uses bytes rather than characters (i.e., option bytes is TRUE), the text documents in x should be encoded in UTF-8.