text_filter: Text Filters

Description

Get or specify the process by which text gets transformed into a sequence of tokens or sentences.

Usage

text_filter(x = NULL, ...)
    text_filter(x) <- value
    # S3 method for corpus_text
text_filter(x = NULL, ...)
    # S3 method for data.frame
text_filter(x = NULL, ...)
    # S3 method for default
text_filter(x = NULL, ...,
                        map_case = TRUE, map_quote = TRUE,
                        remove_ignorable = TRUE,
                        stemmer = NULL, stem_dropped = FALSE,
                        stem_except = NULL,
                        combine = abbreviations("english"),
                        drop_letter = FALSE, drop_number = FALSE,
                        drop_punct = FALSE, drop_symbol = FALSE,
                        drop = NULL, drop_except = NULL,
                        sent_crlf = FALSE,
                        sent_suppress = abbreviations("english"))

Arguments

text or corpus object.

value

text filter object, or NULL for the default.

...

further arguments passed to or from other methods.

map_case

a logical value indicating whether to apply Unicode case mapping to the text. For most languages, this transformation changes uppercase characters to their lowercase equivalents.

map_quote

a logical value indicating whether to replace curly single quotes and double quotes with straight single quotes and double quotes.

remove_ignorable

a logical value indicating whether to remove Unicode "default ignorable" characters like zero-width spaces and soft hyphens.

stemmer

a character value giving the name of the stemming algorithm, or NULL to leave words unchanged. The stemming algorithms are provided by the Snowball stemming library; the following stemming algorithms are available: "arabic", "danish", "dutch", "english", "finnish", "french", "german", "hungarian", "italian", "norwegian", "porter", "portuguese", "romanian", "russian", "spanish", "swedish", "tamil", and "turkish".

stem_dropped

a logical value indicating whether to stem words in the "drop" list.

stem_except

a character vector of exception words to exempt from stemming, or NULL. If left unspecified, stem_except is set equal to the drop argument.

combine

a character vector of multi-word phrases to combine, or NULL; see ‘Combining words’.

drop_letter

a logical value indicating whether to replace "letter" tokens (cased letters, kana, idoegraphic, letter-like numeric characters and other letters) with NA.

drop_number

a logical value indicating whether to replace "number" tokens (decimal digits, words appearing to be numbers, and other numeric characters) with NA.

drop_punct

a logical value indicating whether to replace "punct" tokens (punctuation) with NA.

drop_symbol

a logical value indicating whether to replace "symbol" tokens (emoji, math, currency, URLs, and other symbols) with NA.

drop

a character vector of types to replace with NA, or NULL.

drop_except

a character of types to exempt from the drop rules specified by the drop_letter, drop_number, drop_punct, drop_symbol, and drop arguments, or NULL.

sent_crlf

a logical value indicating whether to break sentences on carriage returns or line feeds.

sent_suppress

a character vector of sentence break suppressions.

Value

text_filter retrieves an objects text filter, optionally with modifications to some of its properties.

text_filter<- sets an object's text filter. Setting the text filter on a character object is not allowed; the object must have type "corpus_text" or be a data frame with a "text" column of type "corpus_text".

Details

The set of properties in a text filter determine the tokenization and sentence breaking rules. See the documentation for text_tokens and text_split for details on the tokenization process.

Examples

Run this code

# NOT RUN {
    # text filter with default options set
    text_filter()

    # specify some options but leave others unchanged
    f <- text_filter(map_case = FALSE, drop = stopwords("english"))

    # set the text filter property
    x <- as_text(c("Marnie the Dog is #1 on the internet."))
    text_filter(x) <- f
    text_tokens(x) # by default, uses x's text_filter to tokenize

    # change a filter property
    f2 <- text_filter(x, map_case = TRUE)
    # equivalent to:
    # f2 <- text_filter(x)
    # f2$map_case <- TRUE

    text_tokens(x, f2) # override text_filter(x)

    # setting text_filter on a data frame is allowed if it has a
    # column names "text" of type "corpus_text"
    d <- data.frame(text = x)
    text_filter(d) <- f2
    text_tokens(d)

    # but you can't set text filters on character objects
    y <- "hello world"
    
# }
# NOT RUN {
text_filter(y) <- f2
# }
# NOT RUN {
 # gives an error

    d2 <- data.frame(text = "hello world", stringsAsFactors = FALSE)
    
# }
# NOT RUN {
text_filter(d2) <- f2
# }
# NOT RUN {
 # gives an error
# }

Run the code above in your browser using DataLab