generate_stoplist: Listing of stop words in different languages.

Description

Generate a vector of stop words in one or several languages.

Usage

generate_stoplist(language = NULL, output_form = 1)

Arguments

language

single string or a character vector. NULL by default. The strings can be language names or ISO-639 language codes as listed by the list_supported_languages(), freely combined, case-sensitive. When no language is recognized, the following error message appears: "The language name or language id you have selected is not supported. (Or you didn't specify a language at all). Check out the supported languages by calling `list_supported_languages`.".

output_form

default 1, alternatively 2 or 3. Option 1 returns a character vector of unique stopwords word forms. Option 2 returns a named vector whose elements are the stopwords word forms and names are the associated stop classes. One word form can occur with different stop classes; hence the word forms in this vector are not unique, unlike Option 1. Option 3 returns a data frame filtered according to the language selection.

Value

The function comes with three output options.

Option `1` outputs a character vector of unique word forms.
Option `2` outputs a named character vector of word forms. The names denote `stop classes` roughly corresponding to parts of speech. Note that, in this output, the word forms are not unique. For instance, in English stopwords, *that* would occur as a subordinating conjunction as well as as a pronoun.
Option `3` (the default) outputs a data frame, where each row represents a combination of language (columns `lang_name` and `lang_id`), word form and word lemma (columns `form` and `lemma`), and several other columns explained below.

All outputs are encoded in UTF-8.

Warning

The function stops when no language is selected.
The stop classes (pre-defined linguistic filters) are not mutually exclusive. Their overlap varies among languages.
The stoplists are fully data-driven. We have set a threshold of 3 occurrences of a combination of language, form, lemma, and upos to remove obvious noise, but some noise is bound to have come through anyway. It is mainly foreign words that were given a regular upos tag (e.g. the English "and" has sneaked in among the German coordinating conjunctions). Another known case is the contraction stop class in English, which, among well-suited instances such as *ain't* includes uses of the so-called Saxonic genitive (e.g. *world's*). Many languages are represented by balanced and large corpora of standard written texts, but some are not; e.g. based mainly on a Bible translation or Wikipedia. Hence also their stopwords can be biased.

References

The underlying data frame `multilingual_stoplist` is based on the official release of Version 2.8 of Universal Dependencies.

https://universaldependencies.org

Zeman, Daniel; et al., 2021, Universal Dependencies 2.8.1, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (<U+00DA>FAL), Faculty of Mathematics and Physics, Charles University, http://hdl.handle.net/11234/1-3687.

Examples

Run this code

# NOT RUN {
generate_stoplist(language = "English", output_form = 1) 
# }
# NOT RUN {
<!-- %>% sample(10)  -->
# }
# NOT RUN {
generate_stoplist(language = "English", output_form = 2) 
# }
# NOT RUN {
<!-- %>% sample(20) -->
# }
# NOT RUN {
  
generate_stoplist(language = "English", output_form = 3) 
# }
# NOT RUN {
<!-- %>% sample_n(10) %>% glimpse() -->
# }
# NOT RUN {
# }

Run the code above in your browser using DataLab