text_tokens: Text Tokenization

Description

Segment text into tokens, each of which is an instance of a particular ‘type’.

Usage

text_tokens(x, filter = text_filter(x))
    text_ntoken(x, filter = text_filter(x))
    text_length(x, filter = text_filter(x))

Arguments

object to be tokenized.

filter

filter specifying the transformation from text to token sequence.

Value

text_tokens returns a list of the same length as x, with the same names. Each list item is a character vector with the tokens for the corresponding element of x.

text_ntoken returns a numeric vector the same length as x, with each element giving the number of non-dropped tokens in the corresponding text. text_length is similar, but includes dropped tokens in the result.

Stemming

We use the stemming algorithms provided by the Snowball library. These algorithms are also available in the SnowballC R package. Unlike that package, we provide the ability to exempt certain words from stemming, using the stem_except argument; see the examples below. If you do not specify the stem_except argument, then we set its value equal to the drop argument.

We also exempt from stemming any case that would turn internal punctuation like the full stop in "u.s" into boundary punctuation like at the end of "u."; otherwise, in examples like this, the stemming procedure would turn single-word tokens into multi-word tokens (compare text_tokens("u.s") with text_tokens("u.")). For English, this likely only affects words ending in ".s".

Combining words

The combine property of a text_filter enables transformations that combine two or more words into a single token. For example, specifying combine = "new york" will cause consecutive instances of the words new and york to get replaced by a single token, new york.

By default, we set combine = abbreviations("english"), so that abbreviations like "Ms." get treated as single tokens; with combine = NULL, trailing punctuation gets split off, and "Ms." gets tokenized into two tokens, for "Ms" and ".".

Details

text_tokens splits texts into token sequences. Each token is an instance of a particular type. This operation proceeds in a series of stages, controlled by the filter argument:

First, we segment the text into words using the boundaries defined by Unicode Standard Annex #29, Section 4. We categorize each word as "letter", "number", "punct", or "symbol" according to the first character in the word. For words with two or more characters that start with extenders like underscore (_), we use the second character in the word to categorize it, treating a second extender as a letter.
Next, we normalize the remaining words by applying the character mappings indicated by the map_case, map_quote, and remove_ignorable. At the end of the second stage, we have segmented the text into a sequence of normalized words, in Unicode composed normal form (NFC).
In the third stage, if the stemmer property is non-NULL, we apply the indicated stemming algorithm to each word that does not match one of the elements of the stem_except character vector. See the ‘Stemming’ section below for more information.
Next, if the combine property is non-NULL, we scan the word sequence from left to right, searching for the longest possible match in the combine list. If a match exists, we replace the word sequence with a single token for that type; otherwise, we create a single-word token. See the ‘Combining words’ section below for more details. After this stage, the sequence elements are ‘tokens’, not ‘words’.
If any of drop_letter, drop_number, drop_punct, or drop_symbol are TRUE, then we replace the tokens with values in the corresponding categories by NA. (For multi-word types created by the combine step, we take the category of the first word in the phrase.) Then, if the drop property is non-NULL, we replace tokens that match elements of this character vector with NA. We can add exceptions to the drop rules by specifying a non-NULL value for the drop_except property: drop_except is a character vector, then we we restore tokens that match elements of vector to their values prior to dropping.

When filter = NULL, we treat all logical properties as FALSE and all other properties as NA or NULL.

Terms specified by the stem_except, combine, drop, and drop_except need to be stemmed (unless stemmer is not NA), but they do not need to be normalized. We normalize the argument values in the manner specified by map_case, map_quote, and remove_ignorable. Thus, for example, if map_case = TRUE, then a token filter with combine = "Mx." produces the same results as a token filter with combine = "mx.".

Examples

Run this code

# NOT RUN {
    text_tokens("The quick ('brown') fox can't jump 32.3 feet, right?")

    # count non-dropped tokens:
    text_ntoken("The quick ('brown') fox can't jump 32.3 feet, right?")

    # count dropped and non-dropped tokens:
    text_length("The quick ('brown') fox can't jump 32.3 feet, right?")

    # don't change case or quotes:
    f <- text_filter(map_case = FALSE, map_quote = FALSE)
    text_tokens("The quick ('brown') fox can't jump 32.3 feet, right?", f)

    # drop common function words ('stop' words):
    text_tokens("Able was I ere I saw Elba.",
                text_filter(drop = stopwords("english")))

    # drop numbers, with some exceptions:"
    text_tokens("0, 1, 2, 3, 4, 5",
                text_filter(drop_number = TRUE,
                            drop_except = c("0", "2", "4")))

    # apply stemming...
    text_tokens("Mary is running", text_filter(stemmer = "english"))

    # ...except for certain words
    text_tokens("Mary is running",
                text_filter(stemmer = "english", stem_except = "mary"))

    # combine abbreviations by default
    text_tokens("Ms. Jones")

    # disable default combinations
    text_tokens("Ms. Jones", text_filter(combine = NULL))

    # add new combinations
    text_tokens("Ms. Jones is from New York City, New York.",
                text_filter(combine = c(abbreviations("english"),
                                        "new york", "new york city")))
# }

Run the code above in your browser using DataLab