tokens: Text Tokenization

Description

Segment text into tokens, each of which is an instance of a particular ‘type’.

Usage

token_filter(map_case = TRUE, map_compat = TRUE,
                 map_quote = TRUE, remove_ignorable = TRUE,
                 stemmer = NULL, stem_except = drop,
                 combine = NULL,
                 drop_letter = FALSE, drop_mark = FALSE,
                 drop_number = FALSE, drop_punct = FALSE,
                 drop_symbol = FALSE, drop_other = FALSE,
                 drop = NULL, drop_except = NULL)
    tokens(x, filter = token_filter())

Arguments

object to be tokenized.

filter

filter specifying the transformation from text to token sequence, a list or token filter object.

map_case

a logical value indicating whether to apply Unicode case mapping to the text. For most languages, this transformation changes uppercase characters to their lowercase equivalents.

map_compat

a logical value indicating whether to apply Unicode compatibility mappings to the characters, those required for NFKC and NFKD normal forms.

map_quote

a logical value indicating whether to replace Unicode quote characters like single quote, double quote, and apostrophe, with an ASCII single quote (').

remove_ignorable

a logical value indicating whether to remove Unicode "default ignorable" characters like zero-width spaces and soft hyphens.

stemmer

a character value giving the name of the stemming algorithm, or NULL to leave words unchanged. The stemming algorithms are provided by the Snowball stemming library; the following stemming algorithms are available: "arabic", "danish", "dutch", "english", "finnish", "french", "german", "hungarian", "italian", "norwegian", "porter", "portuguese", "romanian", "russian", "spanish", "swedish", "tamil", and "turkish".

stem_except

a character vector of exception words to exempt from stemming, or NULL. If left unspecified, stem_except is set equal to the drop argument.

combine

a character vector of multi-word phrases to combine, or NULL; see ‘Combining words’.

drop_letter

a logical value indicating whether to replace "letter" tokens (cased letters, kana, idoegraphic, letter-like numeric characters and other letters) with NA.

drop_mark

a logical value indicating whether to replace "mark" tokens (subscripts, superscripts, modifier letters, modifier symbols, and other marks) with NA.

drop_number

a logical value indicating whether to replace "number" tokens (decimal digits, words appearing to be numbers, and other numeric characters) with NA.

drop_punct

a logical value indicating whether to replace "punct" tokens (punctuation) with NA.

drop_symbol

a logical value indicating whether to replace "symbol" tokens (emoji, math, currency, and other symbols) with NA.

drop_other

a logical value indicating whether to replace "other" tokens (types that do not fall into any other categories) with NA.

drop

a character vector of types to replace with NA, or NULL.

drop_except

a character of types to exempt from the drop rules specified by the drop_letter, drop_mark, drop_number, drop_punct, drop_symbol, drop_other, and drop arguments, or NULL.

Value

A list of the same length as x, with the same names. Each list item is a character vector with the tokens for the corresponding element of x.

Combining words

The combine property of a token_filter enables transformations that combine two or more words into a single token. For example, specifying combine = "new york" will cause consecutive instances of the words new and york to get replaced by a single token, new york.

Details

tokens splits texts into token sequences. Each token is an instance of a particular type. This operation proceeds in a series of stages, controlled by the filter argument:

First, we segment the text into words using the boundaries defined by Unicode Standard Annex #29, Section 4. We categorize each word as "letter", "mark", "number", "punct", "symbol", or "other" according to the first character in the word. For words with two or more characters that start with extenders like underscore (_), we use the second character in the word to categorize it, treating a second extender as a letter.
Next, we normalize the remaining words by applying the character mappings indicated by the map_case, map_compat, map_quote, and remove_ignorable. At the end of the second stage, we have segmented the text into a sequence of normalized words, in Unicode composed normal form (NFC, or if map_compat is TRUE, NFKC).
In the third stage, if the stemmer property is non-NULL, we apply the indicated stemming algorithm to each word that does not match one of the elements of the stem_except character vector.
Next, if the combine property is non-NULL, we scan the word sequence from left to right, searching for the longest possible match in the combine list. If a match exists, we replace the word sequence with a single token for that type; otherwise, we create a single-word token. See the ‘Combining words’ section below for more details. After this stage, the sequence elements are ‘tokens’, not ‘words’.
If any of drop_letter, drop_mark, drop_number, drop_punct, drop_symbol, or drop_other are TRUE, we replace the tokens with values in the corresponding categories by NA. (For multi-word types created by the combine step, we take the category of the first word in the phrase.) Then, if the drop property is non-NULL, we replace tokens that match elements of this character vector with NA. We can add exceptions to the drop rules by specifying a non-NULL value for the drop_except property: drop_except is a character vector, then we we restore tokens that match elements of vector to their values prior to dropping.

When filter = NULL, we treat all logical properties as FALSE and all other properties as NULL.

Examples

Run this code

    tokens("The quick ('brown') fox can't jump 32.3 feet, right?")

    # don't normalize:
    tokens("The quick ('brown') fox can't jump 32.3 feet, right?", NULL)

    # drop common function words ('stop' words):
    tokens("Able was I ere I saw Elba.",
           token_filter(drop = stopwords("english")))

    # drop numbers, with some exceptions:"
    tokens("0, 1, 2, 3, 4, 5",
           token_filter(drop_number = TRUE, drop_except = c("0", "2", "4")))

Run the code above in your browser using DataLab