The generic tokenization function for splitting a given input text into
single words (chains of characters delimited with spaces or punctuation marks).
In obsolete versions of the package stylo
, the default splitting
sequence of chars was "[^[:alpha:]]+"
on Mac/Linux, and
"\\W+_"
on Windows. Two different splitting rules were used, because
regular expressions are not entirely platform-independent; type
help(regexp)
for more details. For the sake of compatibility, then,
in the version 0.5.6 a lengthy list of dozens of letters in a few alphabets
(Latin, Cyrillic, Greek, Hebrew so far) has been indicated explicitly:paste("[^A-Za-z",
# Latin supplement (Western):
"\U00C0-\U00FF",
# Latin supplement (Eastern):
"\U0100-\U01BF",
# Latin extended (phonetic):
"\U01C4-\U02AF",
# modern Greek:
"\U0386\U0388-\U03FF",
# Cyrillic:
"\U0400-\U0481\U048A-\U0527",
# Hebrew:
"\U05D0-\U05EA\U05F0-\U05F4",
# extended Latin:
"\U1E00-\U1EFF",
# ancient Greek:
"\U1F00-\U1FBC\U1FC2-\U1FCC\U1FD0-\U1FDB\U1FE0-\U1FEC\U1FF2-\U1FFC",
"]+",
sep="")
Alternatively, different tokenization rules can be applied through
the option splitting.rule
(see above). ATTENTION: this is the only
piece of coding in the library stylo
that might depend on the
operating system used. While on Mac/Linux the native encoding is Unicode,
on Windows you never know if your text will be loaded proprely. A considerable
solution for Windows users is to convert your texts into Unicode (a variety
of freeware converters are available on the internet), and to use an
appropriate encoding option when loading the files:
read.table("file.txt", encoding = "UTF-8"
or
scan("file.txt", what = "char", encoding = "UTF-8"
. If you use
the functions provided by the library stylo
, you should pass this
option as an argument to your chosen function:
stylo(encoding = "UTF-8")
,
classify(encoding = "UTF-8")
, oppose(encoding = "UTF-8")
.