tokens: Text Tokenization

Description

Segment text into tokens, each of which is an instance of a particular word type (normalized token).

Usage

text_filter(fold_case = TRUE, fold_dash = TRUE,
                fold_quote = TRUE, map_compatible = TRUE,
                remove_control = TRUE, remove_ignorable = TRUE,
                remove_whitespace = TRUE, drop_empty = TRUE,
                stemmer = NULL)
    tokens(x, filter = text_filter())

Arguments

object to be tokenized.

filter

filter to apply to the token sequence, or NULL.

fold_case

a logical value indicating whether to apply Unicode case folding to the text. For most languages, this transformation changes uppercase characters to their lowercase equivalents.

fold_dash

a logical value indicating whether to replace Unicode dash characters like em dash and en dash with an ASCII dash (-).

fold_quote

a logical value indicating whether to replace Unicode quote characters like single quote, double quote, and apostrophe, with an ASCII single quote (').

map_compatible

a logical value indicating whether to apply Unicode compatibility mappings to the characters, those required for NFKC and NFKD normal forms.

remove_control

a logical value indicating whether to remove non-whitespace control characters (from the C0 and C1 character classes, and the delete characer).

remove_ignorable

a logical value indicating whether to remove Unicode "default ignorable" characters like zero-width spaces and soft hyphens.

remove_whitespace

a logical value indicating whether to remove white space characters like space and new line.

drop_empty

a logical value indicating whether to remove tokens which, after applying all other normalizations, are empty (containing no characters). A token can become empty if, for example, it starts as white space.

stemmer

a character value giving the name of the stemming algorithm, or NULL to disable stemming. The stemming algorithms are provided by the http://snowballstem.org/algorithms/; the following stemming algorithms are available: arabic, danish, dutch, english, finnish, french, german, hungarian, italian, norwegian, porter, portuguese, romanian, russian, spanish, swedish, tamil, and turkish.

Value

A list of the same length as x, with the same names. Each list item is a character vector with the tokens for the corresponding element of x.

Details

tokens splits text at the word boundaries defined by http://unicode.org/reports/tr29/#Word_Boundaries, normalizes the text to Unicode NFC normal form, and then applies a series of further transformations to the resulting tokens as specified by the filter argument. To skip the addtional transformation step, specify filter = NULL.

Examples

Run this code

    tokens("The quick ('brown') fox can't jump 32.3 feet, right?")

    # don't normalize:
    tokens("The quick ('brown') fox can't jump 32.3 feet, right?", NULL)