txt.to.words: Split text into words

Description

Generic tokenization function for splitting a given input text into single words (chains of characters delimited by spaces or punctuation marks).

Usage

txt.to.words(input.text, splitting.rule = NULL, preserve.case = FALSE)

Value

The function returns a vector of tokenized words (or other units) as elements.

Arguments

input.text: a string of characters, usually a text.
splitting.rule: an optional argument indicating an alternative splitting regexp. E.g., if your corpus contains no punctuation, you can use a very simple splitting sequence: "[ \t\n]+" or "[[:space:]]+" (in this case, any whitespace is assumed to be a word delimiter). If you deal with non-latin scripts, especially with those that are not supported by the stylo package yet (e.g. Chinese, Japanese, Vietnamese, Georgian), you can indicate your letter characters explicitly: for most Cyrillic scripts try the following code "[^\u0400-\u0482\u048A\u04FF]+". Remember, however, that your texts need to be properly loaded into R (which is quite tricky on Windows; see below).
preserve.case: Whether or not to lowercase all characters in the corpus (default is FALSE).

Author

Maciej Eder, Mike Kestemont

Details

The generic tokenization function for splitting a given input text into single words (chains of characters delimited with spaces or punctuation marks). In obsolete versions of the package stylo, the default splitting sequence of chars was "[^[:alpha:]]+" on Mac/Linux, and "\\W+_" on Windows. Two different splitting rules were used, because regular expressions are not entirely platform-independent; type help(regexp) for more details. For the sake of compatibility, then, in the version >=0.5.6 a lengthy list of dozens of letters in a few alphabets (Latin, Cyrillic, Greek, Hebrew, Arabic so far) has been indicated explicitly:

paste("[^A-Za-z",
    # Latin supplement (Western):
    "\U00C0-\U00FF",
    # Latin supplement (Eastern):
    "\U0100-\U01BF",
    # Latin extended (phonetic):
    "\U01C4-\U02AF",
    # modern Greek:
    "\U0386\U0388-\U03FF",
    # Cyrillic:
    "\U0400-\U0481\U048A-\U0527",
    # Hebrew:
    "\U05D0-\U05EA\U05F0-\U05F4",
    # Arabic:
    "\U0620-\U065F\U066E-\U06D3\U06D5\U06DC",
    # extended Latin:
    "\U1E00-\U1EFF",
    # ancient Greek:
    "\U1F00-\U1FBC\U1FC2-\U1FCC\U1FD0-\U1FDB\U1FE0-\U1FEC\U1FF2-\U1FFC",
    # Coptic:
    "\U03E2-\U03EF\U2C80-\U2CF3",
    # Georgian:
    "\U10A0-\U10FF",
    "]+",
    sep="")

Alternatively, different tokenization rules can be applied through the option splitting.rule (see above). ATTENTION: this is the only piece of coding in the library stylo that might depend on the operating system used. While on Mac/Linux the native encoding is Unicode, on Windows you never know if your text will be loaded proprely. A considerable solution for Windows users is to convert your texts into Unicode (a variety of freeware converters are available on the internet), and to use an appropriate encoding option when loading the files: read.table("file.txt", encoding = "UTF-8" or scan("file.txt", what = "char", encoding = "UTF-8". If you use the functions provided by the library stylo, you should pass this option as an argument to your chosen function: stylo(encoding = "UTF-8"), classify(encoding = "UTF-8"), oppose(encoding = "UTF-8").

Examples

Run this code

txt.to.words("And now, Laertes, what's the news with you?")

# retrieving grammatical codes (POS tags) from a tagged text:
tagged.text = "The_DT family_NN of_IN Dashwood_NNP had_VBD long_RB 
               been_VBN settled_VBN in_IN Sussex_NNP ._."
txt.to.words(tagged.text, splitting.rule = "([A-Za-z,.;!]+_)|[ \n\t]")

Run the code above in your browser using DataLab