Retrieve information and links to dictionaries (lexicons/word lists) available at osf.io/y6g5b.
select.dict(query = NULL, dir = getOption("lingmatch.dict.dir"),
check.md5 = TRUE, mode = "wb")A list with varying entries:
info: The version of osf.io/kjqb8 stored internally; a
data.frame with dictionary names as row names, and information about each dictionary in columns.
Also described at
osf.io/y6g5b/wiki/dict_variables,
here short (corresponding to the file name [{short}.(csv|dic)] and
wiki urls [https://osf.io/y6g5b/wiki/{short}]) is set as row names and removed:
name: Full name of the dictionary.
description: Description of the dictionary, relating to its purpose and
development.
note: Notes about processing decisions that additionally alter the original.
constructor: How the dictionary was constructed:
algorithm: Terms were selected by some automated process, potentially
learned from data or other resources.
crowd: Several individuals rated the terms, and in aggregate those ratings
translate to categories and weights.
mixed: Some combination of the other methods, usually in some iterative
process.
team: One of more individuals make decisions about term inclusions,
categories, and weights.
subject: Broad, rough subject or purpose of the dictionary:
emotion: Terms relate to emotions, potentially exemplifying or expressing
them.
general: A large range of categories, aiming to capture the content of the
text.
impression: Terms are categorized and weighted based on the impression they
might give.
language: Terms are categorized or weighted based on their linguistic
features, such as part of speech, specificity, or area of use.
social: Terms relate to social phenomena, such as characteristics or concerns
of social entities.
terms: Number of unique terms across categories.
term_type: Format of the terms:
glob: Include asterisks which denote inclusion of any characters until a
word boundary.
glob+: Glob-style asterisks with regular expressions within terms.
ngram: Includes any number of words as a term, separated by spaces.
pattern: A string of characters, potentially within or between words, or
spanning words.
regex: Regular expressions.
stem: Unigrams with common endings removed.
unigram: Complete single words.
weighted: Indicates whether weights are associated with terms. This
determines the file type of the dictionary: dictionaries with weights are stored
as .csv, and those without are stored as .dic files.
regex_characters: Logical indicating whether special regular expression
characters are present in any term, which might need to be escaped if the terms are used
in regular expressions. Glob-type terms allow complete parens (at least one open and one
closed, indicating preceding or following words), and initial and terminal asterisks. For
all other terms, [](){}*.^$+?\| are counted as regex characters. These could be
escaped in R with gsub('([][)(}{*.^$+?\\|])', '\\\1', terms) if terms
is a character vector, and in Python with (importing re)
[re.sub(r'([][(){}*.^$+?\|])', r'\\1', term) for term in terms] if terms
is a list.
categories: Category names in the order in which they appear in the dictionary
file, separated by commas.
ncategories: Number of categories.
original_max: Maximum value of the original dictionary before standardization:
original values / max(original values) * 100. Dictionaries with no weights are
considered to have a max of 1.
osf: ID of the file on OSF, translating to the file's URL:
https://osf.io/osf.
wiki: URL of the dictionary's wiki.
downloaded: Path to the file if downloaded, and '' otherwise.
selected: A subset of info selected by query.
A character matching a dictionary name, or a set of keywords to search for in dictionary information.
Path to a folder containing dictionaries, or where you want them to be saved. Will look in getOption('lingmatch.dict.dir') and '~/Dictionaries' by default.
Logical; if TRUE (default), retrieves the MD5 checksum from OSF,
and compares it with that calculated from the downloaded file to check its integrity.
Passed to download.file when downloading files.
Other Dictionary functions:
dictionary_meta(),
download.dict(),
lma_patcat(),
lma_termcat(),
read.dic(),
report_term_matches()
# just retrieve information about available dictionaries
dicts <- select.dict()$info
dicts[1:10, 4:9]
# select all dictionaries mentioning sentiment or emotion
sentiment_dicts <- select.dict("sentiment emotion")$selected
sentiment_dicts[1:10, 4:9]
Run the code above in your browser using DataLab