Learn R Programming

corpus (version 0.3.1)

tokens: Text Tokenization

Description

Segment text into tokens, each of which is an instance of a particular word type (normalized token).

Usage

text_filter(fold_case = TRUE, fold_dash = TRUE,
                fold_quote = TRUE, map_compatible = TRUE,
                remove_control = TRUE, remove_ignorable = TRUE,
                remove_whitespace = TRUE, drop_empty = TRUE,
                stemmer = NULL)

tokens(x, filter = text_filter())

Arguments

x
object to be tokenized.
filter
filter to apply to the token sequence, or NULL.
fold_case
a logical value indicating whether to apply Unicode case folding to the text. For most languages, this transformation changes uppercase characters to their lowercase equivalents.
fold_dash
a logical value indicating whether to replace Unicode dash characters like em dash and en dash with an ASCII dash (-).
fold_quote
a logical value indicating whether to replace Unicode quote characters like single quote, double quote, and apostrophe, with an ASCII single quote (').
map_compatible
a logical value indicating whether to apply Unicode compatibility mappings to the characters, those required for NFKC and NFKD normal forms.
remove_control
a logical value indicating whether to remove non-whitespace control characters (from the C0 and C1 character classes, and the delete characer).
remove_ignorable
a logical value indicating whether to remove Unicode "default ignorable" characters like zero-width spaces and soft hyphens.
remove_whitespace
a logical value indicating whether to remove white space characters like space and new line.
drop_empty
a logical value indicating whether to remove tokens which, after applying all other normalizations, are empty (containing no characters). A token can become empty if, for example, it starts as white space.
stemmer
a character value giving the name of the stemming algorithm, or NULL to disable stemming. The stemming algorithms are provided by the http://snowballstem.org/algorithms/; the following stemming algorithms are available: arabic, danish, dutch, english, finnish, french, german, hungarian, italian, norwegian, porter, portuguese, romanian, russian, spanish, swedish, tamil, and turkish.

Value

A list of the same length as x, with the same names. Each list item is a character vector with the tokens for the corresponding element of x.

Details

tokens splits text at the word boundaries defined by http://unicode.org/reports/tr29/#Word_Boundaries, normalizes the text to Unicode NFC normal form, and then applies a series of further transformations to the resulting tokens as specified by the filter argument. To skip the addtional transformation step, specify filter = NULL.

See Also

sentences.

Examples

Run this code
    tokens("The quick ('brown') fox can't jump 32.3 feet, right?")

    # don't normalize:
    tokens("The quick ('brown') fox can't jump 32.3 feet, right?", NULL)

Run the code above in your browser using DataLab