textcat: N-Gram Based Text Categorization

Description

Categorize texts by finding the closest n-gram reference profile.

Usage

textcat(x, p = ECIMCI_profiles, method = "CT")

Arguments

a character vector, or an object coercible to this using as.character.

a textcat profile db (see textcat_profile_db).

method

a character string specifying a built-in method, or a used-defined function for computing distances between n-gram profiles. See Details for available built-in methods.

Details

Currently, the following distance methods are available. [object Object],[object Object],[object Object],[object Object],[object Object],[object Object] For the measures based on distances of frequency distributions, n-grams in the text and the reference profile are combined, and missing n-grams are given a small positive absolute frequency (currently, 1e-6).

For each given text, its n-gram profile is computed using the options in the reference profile db. Then, the distance between the profile and the reference profiles is computed, and the text is categorized into the category of the closest profile (if this is not unique, NA is obtained).

Unless the profile db uses bytes rather than characters, the texts in x should be encoded in UTF-8.

References

W. B. Cavnar and J. M. Trenkle (1994), N-Gram-Based Text Categorization. In ``Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval'', 161--175.

Examples

Run this code

textcat(c("This is an english sentence.",
          "Das ist ein deutscher satz."))

Run the code above in your browser using DataLab