TextCat is a Perl implementation of the Cavnar and Trenkle
N-Gram-Based Text Categorization technique by Gertjan van
Noord http://odur.let.rug.nl/~vannoord/TextCat/ which was
subsequently integrated into SpamAssassin. It provides byte N-gram
profiles for 75 languages (more precisely, language/encoding
combinations). TC_byte_profiles provides these byte profiles.
TC_char_profiles provides a subset of 56 character profiles
obtained by converting the byte sequences to UTF-8 strings where
possible.
The category ids are unchanged from the original, and give the full
(English) name of the language, optionally combined the name of the
encoding script. Note that scots indicates Scots, the
Germanic language variety historically spoken in Lowland Scotland and
parts of Ulster, to be distinguished from Scottish Gaelic (named
scots_gaelic in the profiles), the Celtic language variety
spoken in most of the western Highlands and in the Hebrides (see
http://en.wikipedia.org/wiki/Scots_language).