Learn R Programming

⚠️There's a newer version (2.3) of this package.Take me there.

stopwords: the R package

R package providing “one-stop shopping” (or should that be “one-shop stopping”?) for stopword lists in R, for multiple languages and sources. No longer should text analysis or NLP packages bake in their own stopword lists or functions, since this package can accommodate them all, and is easily extended.

Created by David Muhr, and extended in cooperation with Kenneth Benoit and Kohei Watanabe.

Installation

# from CRAN
install.packages("stopwords")

# Or get the development version from GitHub:
# install.packages("devtools")
devtools::install_github("quanteda/stopwords")

Usage

head(stopwords::stopwords("de", source = "snowball"), 20)
##  [1] "aber"    "alle"    "allem"   "allen"   "aller"   "alles"   "als"    
##  [8] "also"    "am"      "an"      "ander"   "andere"  "anderem" "anderen"
## [15] "anderer" "anderes" "anderm"  "andern"  "anderr"  "anders"

head(stopwords::stopwords("ja", source = "marimo"), 20)
##  [1] "私"       "僕"       "自分"     "自身"     "我々"     "私達"    
##  [7] "あなた"   "彼"       "彼女"     "彼ら"     "彼女ら"   "あれ"    
## [13] "それ"     "これ"     "あれら"   "あれらの" "それら"   "それらの"
## [19] "これら"   "これらの"

For compatibility with the former quanteda::stopwords():

head(stopwords::stopwords("german"), 20)
##  [1] "aber"    "alle"    "allem"   "allen"   "aller"   "alles"   "als"    
##  [8] "also"    "am"      "an"      "ander"   "andere"  "anderem" "anderen"
## [15] "anderer" "anderes" "anderm"  "andern"  "anderr"  "anders"

Explore sources and languages:

# list all sources
stopwords::stopwords_getsources()
## [1] "snowball"      "stopwords-iso" "misc"          "smart"        
## [5] "marimo"        "ancient"       "nltk"          "perseus"

# list languages for a specific source
stopwords::stopwords_getlanguages("snowball")
##  [1] "da" "de" "en" "es" "fi" "fr" "hu" "ir" "it" "nl" "no" "pt" "ro" "ru" "sv"

Languages available

The following coverage of languages is currently available, by source. Note that the inclusiveness of the stopword lists will vary by source, and the number of languages covered by a stopword list does not necessarily mean that the source is better than one with more limited coverage. (There may be many reasons to prefer the default “snowball” source over the “stopwords-iso” source, for instance.)

The following languages are currently available:

LanguageCodesnowballmarimonltkstopwords-isoOther
Afrikaansaf
Arabicarmisc
Armenianhy
Azerbaijaniaz
Basqueeu
Bengalibn
Bretonbr
Bulgarianbg
Catalancamisc
Chinesezhmisc
Croatianhr
Czechcs
Danishda
Dutchnl
Englishensmart
Esperantoeo
Estonianet
Finnishfi
Frenchfr
Galiciangl
Germande
Greekelmisc
Greek (ancient)grcancient, perseus
Gujaratigumisc
Hausaha
Hebrewhe
Hindihi
Hungarianhu
Indonesianid
Irishga
Italianit
Japaneseja
Kazakhkk
Koreanko
Kurdishku
Latinlaancient, perseus
Lithuanianlt
Latvianlv
Malayms
Marathimr
Nepalimr
Norwegianno
Persianfa
Polishpl
Portuguesept
Romanianro
Russianru
Slovaksk
Sloveniansl
Somaliso
Southern Sothost
Spanishes
Swahilisw
Swedishsv
Thaith
Tagalogtl
Tajiktg
Turkishtr
Ukrainianuk
Urduur
Vietnamesevi
Yorubayo
Zuluzu

Basic usage

head(stopwords::stopwords("de", source = "snowball"), 20)
##  [1] "aber"    "alle"    "allem"   "allen"   "aller"   "alles"   "als"    
##  [8] "also"    "am"      "an"      "ander"   "andere"  "anderem" "anderen"
## [15] "anderer" "anderes" "anderm"  "andern"  "anderr"  "anders"

head(stopwords::stopwords("de", source = "stopwords-iso"), 20)
##  [1] "a"           "ab"          "aber"        "ach"         "acht"       
##  [6] "achte"       "achten"      "achter"      "achtes"      "ag"         
## [11] "alle"        "allein"      "allem"       "allen"       "aller"      
## [16] "allerdings"  "alles"       "allgemeinen" "als"         "also"

For compatibility with the former quanteda::stopwords():

head(stopwords::stopwords("german"), 20)
##  [1] "aber"    "alle"    "allem"   "allen"   "aller"   "alles"   "als"    
##  [8] "also"    "am"      "an"      "ander"   "andere"  "anderem" "anderen"
## [15] "anderer" "anderes" "anderm"  "andern"  "anderr"  "anders"

Explore sources and languages:

# list all sources
stopwords::stopwords_getsources()
## [1] "snowball"      "stopwords-iso" "misc"          "smart"        
## [5] "marimo"        "ancient"       "nltk"          "perseus"

# list languages for a specific source
stopwords::stopwords_getlanguages("snowball")
##  [1] "da" "de" "en" "es" "fi" "fr" "hu" "ir" "it" "nl" "no" "pt" "ro" "ru" "sv"

Modifying stopword lists

It is now possible to edit your own stopword lists, using the interactive editor, with functions from the quanteda package (>= v2.02). For instance to edit the English stopword list for the Snowball source:

# edit the English stopwords
my_stopwords <- quanteda::char_edit(stopwords("en", source = "snowball"))

To edit stopwords whose underlying structure is a list, such as the “marimo” source, we can use the list_edit() function:

# edit the English stopwords
my_stopwordlist <- quanteda::list_edit(stopwords("en", source = "marimo", simplify = FALSE))

Finally, it’s possible to remove stopwords using pattern matching. The default is the easy-to-use “glob” style matching, which is equivalent to fixed matching when no wildcard characters are used. So to remove personal pronouns from the English Snowball word list, for instance, this would work:

library("quanteda", warn.conflicts = FALSE)
## Package version: 2.9.9000
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
posspronouns <- stopwords::data_stopwords_marimo$en$pronoun$possessive
posspronouns
## [1] "my"    "our"   "your"  "his"   "her"   "its"   "their"

stopwords("en", source = "snowball") %>%
  head(n = 10)
##  [1] "i"         "me"        "my"        "myself"    "we"        "our"      
##  [7] "ours"      "ourselves" "you"       "your"

See the difference when we remove them – “my”, “ours”, and “your” are gone:

stopwords("en", source = "snowball") %>%
  head(n = 10) %>%
  char_remove(pattern = posspronouns)
## [1] "i"         "me"        "myself"    "we"        "ours"      "ourselves"
## [7] "you"

There is no char_add(), since it’s just as easy to use c() for this, but there is a char_keep() for positive selection rather than removal.

Adding stopwords to your own package

As of version 1.1, we’ve made it a one-step process to add stopwords() to your package through a re-export. Simply call use_stopwords() like this:

> stopwords::use_stopwords()
✔ Setting active project to '/Users/me/GitHub/mypackage'
✔ Adding 'stopwords' to Imports field in DESCRIPTION
✔ Writing 'R/use-stopwords.R'
● Run `devtools::document()` to update 'NAMESPACE'

> devtools::document()
Updating mypackage documentation
Updating collate directive in  /Users/me/GitHub/mypackage/DESCRIPTION 
Writing NAMESPACE
Loading mypackage
Writing NAMESPACE
Writing stopwords.Rd

Contributing

Additional sources can be defined and contributed by adding new data objects, as follows:

  1. Data object. Create a named list of characters, in UTF-8 format, consisting of the stopwords for each language. The ISO-639-1 language code will form the name of the list element, and the values of each element will be the character vector of stopwords for literal matches. The data object should follow the package naming convention, and be called data_stopwords_newsource, where newsource is replaced by the name of the new source.

  2. Documentation. The new source should be clearly documented, especially the source from which was taken.

License

This package as well as the source repositories are licensed under MIT.

Copy Link

Version

Install

install.packages('stopwords')

Monthly Downloads

13,169

Version

2.1

License

MIT + file LICENSE

Issues

Pull Requests

Stars

Forks

Maintainer

Kenneth Benoit

Last Published

December 8th, 2020

Functions in stopwords (2.1)

use_stopwords

Use stopwords in your package
stopwords_options

set package options for stopwords
stopwords_getsources

list available stopwords sources
data_stopwords_stopwordsiso

multilingual stopwords from https://github.com/stopwords-iso/stopwords-iso
stopwords_getlanguages

list available stopwords country codes
stopwords

Collection of stopwords in multiple languages
data_stopwords_snowball

snowball stopword list
stopwords-package

stopwords: one-stop shopping for stopwords in R
data_stopwords_marimo

stopword lists including parts-of-speech
data_stopwords_perseus

stopword lists for ancient languages - Perseus Digital Library
lookup_iso_639_1

return ISO-639-1 code for a given language name
data_stopwords_smart

stopword lists from the SMART system
data_stopwords_misc

miscellaneous stopword lists
data_stopwords_ancient

stopword lists for ancient languages
data_stopwords_nltk

stopword lists from the Python NLTK library