koRpus (version 0.13-8)

guess.lang: Guess language a text is written in

Description

This function tries to guess the language a text is written in.

Usage

guess.lang(
  txt.file,
  udhr.path,
  comp.length = 300,
  keep.udhr = FALSE,
  quiet = TRUE,
  in.mem = TRUE,
  format = "file"
)

Arguments

txt.file

A character vector pointing to the file with the text to be analyzed.

udhr.path

A character string, either pointing to the directory where you unzipped the translations of the Universal Declaration of Human Rights, or to the ZIP file containing them.

comp.length

Numeric value, giving the number of characters to be used of txt to estimate the language.

keep.udhr

Logical, whether all the UDHR translations should be kept in the resulting object.

quiet

Logical. If FALSE, short status messages will be shown.

in.mem

Logical. If TRUE, the gzip compression will remain in memory (using memCompress), which is probably the faster method. Otherwise temporary files are created and automatically removed on exit.

format

Either "file" or "obj". If the latter, txt.file is not interpreted as a file path but the text to analyze itself.

Value

An object of class kRp.lang.

Details

To accomplish the task, the method described by Benedetto, Caglioti & Loreto (2002) is used, utilizing both gzip compression and tranlations of the Universal Declaration of Human Rights[1]. The latter holds the world record for being translated into the most different languages, and is publicly available.

References

Benedetto, D., Caglioti, E. & Loreto, V. (2002). Language trees and zipping. Physical Review Letters, 88(4), 048702.

[1] https://www.ohchr.org/EN/UDHR/Pages/UDHRIndex.aspx

[2] https://unicode.org/udhr/

Examples

Run this code
# NOT RUN {
  # using the still zipped bulk file
  guess.lang(
    file.path("~","data","some.txt"),
    udhr.path=file.path("~","data","udhr_txt.zip")
  )
  # using the unzipped UDHR archive
  guess.lang(
    file.path("~","data","some.txt"),
    udhr.path=file.path("~","data","udhr_txt")
  )
# }

Run the code above in your browser using DataLab