stopwords: access built-in stopwords

Description

This function retrieves stopwords from the type specified in the kind argument and returns the stopword list as a character vector The default is English.

Usage

stopwords(kind = "english", verbose = FALSE)

Arguments

kind

The pre-set kind of stopwords (as a character string). Allowed values are english, SMART, danish, french, hungarian, norwegian, russian, swedish,

verbose

if FALSE, suppress the annoying warning note

Value

a character vector of stopwords

A note of caution

Stop words are an arbitrary choice imposed by the user, and accessing a pre-defined list of words to ignore does not mean that it will perfectly fit your needs. You are strongly encourged to inspect the list and to make sure it fits your particular requirements. The built-in English stopword list does not contain "will", for instance, because of its multiple meanings, but you might want to include this word for your own application.

Details

The stopword list are SMART English stopwords from the SMART information retrieval system (obtained from http://jmlr.csail.mit.edu/papers/volume5/lewis04a/a11-smart-stop-list/english.stop) and a set of stopword lists from the Snowball stemmer project in different languages (obtained from http://svn.tartarus.org/snowball/trunk/website/algorithms/ -- see the stop.txt files in each subdirectory). Supported languages are arabic, danish, dutch, english, finnish, french, german, hungarian, italian, norwegian, portuguese, russian, spanish, and swedish. Language names are case sensitive.

Examples

Run this code

stopwords("english")[1:5]
stopwords("italian")[1:5]
stopwords("arabic")[1:5]

# adding to the built-in stopword list
toks <- tokenize("The judge will sentence Mr. Adams to nine years in prison", removePunct = TRUE)
removeFeatures(toks, c(stopwords("english"), "will", "mr", "nine"))

Run the code above in your browser using DataLab