Generic tokenization function for splitting a given input text into single words (chains of characters delimited by spaces or punctuation marks).
txt.to.words(input.text, splitting.rule = NULL, preserve.case = FALSE)
The function returns a vector of tokenized words (or other units) as elements.
a string of characters, usually a text.
an optional argument indicating an alternative
splitting regexp. E.g., if your corpus contains no punctuation, you can
use a very simple splitting sequence: "[ \t\n]+"
or
"[[:space:]]+"
(in this case, any whitespace is assumed to be
a word delimiter). If you deal with non-latin scripts, especially with
those that are not supported by the stylo
package yet (e.g. Chinese,
Japanese, Vietnamese, Georgian), you can indicate your letter characters
explicitly: for most Cyrillic scripts try the following code
"[^\u0400-\u0482\u048A\u04FF]+"
. Remember, however,
that your texts need to be properly loaded into R (which is quite tricky
on Windows; see below).
Whether or not to lowercase all characters
in the corpus (default is FALSE
).
Maciej Eder, Mike Kestemont
The generic tokenization function for splitting a given input text into
single words (chains of characters delimited with spaces or punctuation marks).
In obsolete versions of the package stylo
, the default splitting
sequence of chars was "[^[:alpha:]]+"
on Mac/Linux, and
"\\W+_"
on Windows. Two different splitting rules were used, because
regular expressions are not entirely platform-independent; type
help(regexp)
for more details. For the sake of compatibility, then,
in the version >=0.5.6 a lengthy list of dozens of letters in a few alphabets
(Latin, Cyrillic, Greek, Hebrew, Arabic so far) has been indicated explicitly:
paste("[^A-Za-z",
# Latin supplement (Western):
"\U00C0-\U00FF",
# Latin supplement (Eastern):
"\U0100-\U01BF",
# Latin extended (phonetic):
"\U01C4-\U02AF",
# modern Greek:
"\U0386\U0388-\U03FF",
# Cyrillic:
"\U0400-\U0481\U048A-\U0527",
# Hebrew:
"\U05D0-\U05EA\U05F0-\U05F4",
# Arabic:
"\U0620-\U065F\U066E-\U06D3\U06D5\U06DC",
# extended Latin:
"\U1E00-\U1EFF",
# ancient Greek:
"\U1F00-\U1FBC\U1FC2-\U1FCC\U1FD0-\U1FDB\U1FE0-\U1FEC\U1FF2-\U1FFC",
# Coptic:
"\U03E2-\U03EF\U2C80-\U2CF3",
# Georgian:
"\U10A0-\U10FF",
"]+",
sep="")
Alternatively, different tokenization rules can be applied through
the option splitting.rule
(see above). ATTENTION: this is the only
piece of coding in the library stylo
that might depend on the
operating system used. While on Mac/Linux the native encoding is Unicode,
on Windows you never know if your text will be loaded proprely. A considerable
solution for Windows users is to convert your texts into Unicode (a variety
of freeware converters are available on the internet), and to use an
appropriate encoding option when loading the files:
read.table("file.txt", encoding = "UTF-8"
or
scan("file.txt", what = "char", encoding = "UTF-8"
. If you use
the functions provided by the library stylo
, you should pass this
option as an argument to your chosen function:
stylo(encoding = "UTF-8")
,
classify(encoding = "UTF-8")
, oppose(encoding = "UTF-8")
.
txt.to.words.ext
, txt.to.features
,
make.ngrams
, load.corpus
txt.to.words("And now, Laertes, what's the news with you?")
# retrieving grammatical codes (POS tags) from a tagged text:
tagged.text = "The_DT family_NN of_IN Dashwood_NNP had_VBD long_RB
been_VBN settled_VBN in_IN Sussex_NNP ._."
txt.to.words(tagged.text, splitting.rule = "([A-Za-z,.;!]+_)|[ \n\t]")
Run the code above in your browser using DataLab