tokens: Text Tokenization

Description

Segment text into tokens, each of which is an instance of a particular ‘term’ (formally, a type).

Usage

text_filter(map_case = TRUE, map_compat = TRUE,
                map_dash = TRUE, map_quote = TRUE, 
                remove_control = TRUE, remove_ignorable = TRUE,
                remove_space = TRUE, ignore_empty = TRUE,
                stemmer = NULL, stem_except = drop, combine = NULL,
                drop_symbol = FALSE, drop_number = FALSE,
                drop_letter = FALSE, drop_kana = FALSE,
                drop_ideo = FALSE, drop = NULL, drop_except = select,
                select = NULL)
    tokens(x, filter = text_filter())

Arguments

object to be tokenized.

filter

filter specifying the transformation from text to token sequence, a list or text_filter object.

map_case

a logical value indicating whether to apply Unicode case mapping to the text. For most languages, this transformation changes uppercase characters to their lowercase equivalents.

map_compat

a logical value indicating whether to apply Unicode compatibility mappings to the characters, those required for NFKC and NFKD normal forms.

map_dash

a logical value indicating whether to replace Unicode dash characters like em dash and en dash with an ASCII dash (-).

map_quote

a logical value indicating whether to replace Unicode quote characters like single quote, double quote, and apostrophe, with an ASCII single quote (').

remove_control

a logical value indicating whether to remove non-white-space control characters (from the C0 and C1 character classes, and the delete character).

remove_ignorable

a logical value indicating whether to remove Unicode "default ignorable" characters like zero-width spaces and soft hyphens.

remove_space

a logical value indicating whether to remove white-space characters like space and new line.

ignore_empty

a logical value indicating whether to ignore tokens which, after applying all other normalizations, are empty (containing no characters). A token can become empty if, for example, it starts as white-space.

stemmer

a character value giving the name of the stemming algorithm, or NULL to leave words unchanged. The stemming algorithms are provided by the Snowball stemming library; the following stemming algorithms are available: "arabic", "danish", "dutch", "english", "finnish", "french", "german", "hungarian", "italian", "norwegian", "porter", "portuguese", "romanian", "russian", "spanish", "swedish", "tamil", and "turkish".

stem_except

a character vector of exception words to exempt from stemming, or NULL. If left unspecified, stem_except is set equal to the drop argument.

combine

a character vector of multi-word phrases to combine, or NULL; see ‘Combining words’.

drop_symbol

a logical value indicating whether to replace "symbol" terms (punctuation, emoji, and other words that are not classified as "number", "letter", "kana", or "ideo") with NA.

drop_number

a logical value indicating whether to replace "number" terms (starting with numerals) with NA.

drop_letter

a logical value indicating whether to replace "letter" terms (starting with letters excluding kana and ideographic characters) with NA.

drop_kana

a logical value indicating whether to replace "kana" terms (starting with kana characters) with NA.

drop_ideo

a logical value indicating whether to replace "ideo" terms (starting with ideographic characters) with NA.

drop

a character vector of terms to replace with NA, or NULL.

drop_except

a character of terms to exempt from the drop rules specified by the drop_symbol, drop_number, drop_letter, drop_kana, drop_ideo, and drop arguments, or NULL. If left unspecified, drop_except is set equal to the select argument.

select

a character vector of terms to keep, or NULL; if non-NULL, tokens that are not on this list get replaced with NA.

Value

A list of the same length as x, with the same names. Each list item is a character vector with the tokens for the corresponding element of x.

Combining words

The combine property of a text_filter enables transformations that combine two or more words into a single token. For example, specifying combine = "new york" will cause consecutive instances of the words new and york to get replaced by a single token, new york.

Details

tokens splits texts into token sequences. Each token is an instance of a particular term (formally, type). This operation proceeds in a series of stages, controlled by the filter argument:

First, we segment the text into words using the boundaries defined by Unicode Standard Annex #29, Section 4. We categorize each word as "number", "letter", "kana", "ideo", or "symbol" according to whether the first character is a numeral, letter, kana, ideographic, or other character, respectively. For words with two or more characters that start with extenders like underscore (_), we use the second character in the word to categorize it, treating a second extender as a letter.
Next, we normalize the words by applying the character mappings indicated by the map_case, map_compat, map_dash, map_quote, remove_control, remove_ignorable, and remove_space properties. If, after normalization, a word is empty (for example, if it started out as all white-space and remove_space is TRUE), and if ignore_empty is TRUE, we delete the word from the sequence. At the end of the second stage, we have segmented the text into a sequence of normalized words, in Unicode composed normal form (NFC, or if map_compat is TRUE, NFKC).
In the third stage, if the stemmer property is non-NULL, we apply the indicated stemming algorithm to each word that does not match one of the elements of the stem_except character vector.
Next, if the combine property is non-NULL, we scan the word sequence from left to right, searching for the longest possible match in the combine list. If a match exists, we replace the word sequence with a single token for that term; otherwise, we create a single-word token. See the ‘Combining words’ section below for more details. After this stage, the sequence elements are ‘tokens’, not ‘words’.
If any of drop_symbol, drop_number, drop_letter, drop_kana, or drop_ideo are TRUE, we replace the terms in the corresponding categories by NA. (For multi-word terms, we take the category of the first word in the phrase.) Then, if the drop property is non-NULL, we replace terms that match elements of this character vector with NA. We can add exceptions to the drop rules by specifying a non-NULL value for the drop_except property: drop_except is a character vector, then we we restore terms that match elements of vector to their values prior to dropping.
Finally, if select is non-NULL, we replace terms that do not match elements of this character vector with NA.

When filter = NULL, we treat all logical properties as FALSE and all other properties as NULL.

Examples

Run this code

    tokens("The quick ('brown') fox can't jump 32.3 feet, right?")

    # don't normalize:
    tokens("The quick ('brown') fox can't jump 32.3 feet, right?", NULL)

    # drop common function words ('stop' words):
    tokens("Able was I ere I saw Elba.",
           text_filter(drop = stopwords("english")))

    # drop numbers, with some exceptions:"
    tokens("0, 1, 2, 3, 4, 5",
           text_filter(drop_number = TRUE, drop_except = c("0", "2", "4")))

Run the code above in your browser using DataLab