corpus (version 0.10.1)

stem_snowball: Snowball Stemmer

Description

Stem a set of terms using one of the algorithms provided by the Snowball stemming library.

Usage

stem_snowball(x, algorithm = "en")

Arguments

x

character vector of terms to stem.

algorithm

stemming algorithm; see ‘Details’ for the valid choices.

Value

A character vector the same length and names as the input, x, with entries containing the corresponding stems.

Details

Apply a Snowball stemming algorithm to a vector of input terms, x, returning the result in a character vector of the same length with the same names.

The algorithm argument specifies the stemming algorithm. Valid choices include the following: "ar" ("arabic"), "da" ("danish"), "de" ("german"), "en" ("english"), "es" ("spanish"), "fi" ("finnish"), "fr" ("french"), "hu" ("hungarian"), "it" ("italian"), "nl" ("dutch"), "no" ("norwegian"), "pt" ("portuguese"), "ro" ("romanian"), "ru" ("russian"), "sv" ("swedish"), "ta" ("tamil"), "tr" ("turkish"), and "porter". Setting algorithm = NULL gives a stemmer that returns its input unchanged.

The function only stems single-word terms of kind "letter"; it leaves other inputs (multi-word terms, and terms of kind "number", "punct", and "symbol") unchanged.

The Snowball stemming library provides the underlying implementation. The wordStem function from the SnowballC package provides a similar interface, but that function applies the algorithm to all input terms, regardless of the kind of the term.

See Also

new_stemmer, text_filter.

Examples

Run this code
# NOT RUN {
# apply english stemming algorithm; don't stem non-letter terms
stem_snowball(c("win", "winning", "winner", "#winning"))

# compare with SnowballC, which stems all kinds, not just letter
# }
# NOT RUN {
SnowballC::wordStem(c("win", "winning", "winner", "#winning"), "en")
# }

Run the code above in your browser using DataLab