textcat_profile_db: Textcat Profile Dbs

Description

Create $n$-gram profile dbs for text categorization.

Usage

textcat_profile_db(x, id = NULL, method = NULL, ...,
                   options = list(), profiles = NULL)

Arguments

a character vector of text documents, or an Robject of text documents extractable via as.character.

a character vector giving the categories of the texts to be recycled to the length of x, or NULL (default), indicating to treat each text document separately.

method

a character string specifying a built-in method, or a user-defined function for computing distances between $n$-gram profiles, or NULL (default), corresponding to using the method and options used for creating profiles

...

options to be passed to the method for creating profiles.

options

a list of such options.

profiles

a textcat profile db object.

Details

The text documents are split according to the given categories, and $n$-gram profiles are computed using the specified method, with options either those used for creating profiles if this is not NULL, or by combining the options given in ... and options and merging with the default profile options specified by the textcat option profile_options using exact name matching. The method and options employed for building the db are stored in the db as attributes "method" and "options", respectively.

There is a c method for combining profile dbs provided that these have identical options. There are also a [ method for subscripting and as.matrix and as.simple.triplet.matrix methods to export the profiles to a dense matrix or the sparse simple triplet matrix representation provided by package slam, respectively.

Currently, the only available built-in method is "textcnt", which has the following options: [object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object] This method uses textcnt in package tau for computing $n$-gram profiles, with n, split, perl and useBytes corresponding to the respective textcnt arguments, and option reduce setting argument marker as needed. $N$-grams listed in option ignore are removed, and only the most frequent remaining ones retained, with the maximal number given by option size.

Unless the profile db uses bytes rather than characters (i.e., option useBytes is TRUE), text documents in x containing non-ASCII characters must declare their encoding (see Encoding), and will be re-encoded to UTF-8.

Note that option n specifies all numbers of characters or bytes to be used in the profiles, and not just the maximal number: e.g., taking n = 3 will create profiles only containing tri-grams.

Examples

Run this code

## Obtain the texts of the standard licenses shipped with R.
files <- dir(file.path(R.home("share"), "licenses"), "^[A-Z]",
             full.names = TRUE)
texts <- sapply(files,
                function(f) paste(readLines(f), collapse = ""))
names(texts) <- basename(files)
## Build a profile db using the same method and options as for building
## the ECIMCI character profiles.
profiles <- textcat_profile_db(texts, profiles = ECIMCI_profiles)
## Inspect the 10 most frequent n-grams in each profile.
lapply(profiles, head, 10L)
## Combine into one frequency table.
tab <- as.matrix(profiles)
tab[, 1 : 10]
## Determine languages.
textcat(profiles, ECIMCI_profiles)

Run the code above in your browser using DataLab