lma_dict: English Function Word Category and Special Character Lists

Description

Returns a list of function words based on the Linguistic Inquiry and Word Count 2015 dictionary (in terms of category names -- words were selected independently), or a list of special characters and patterns.

Usage

lma_dict(..., as.regex = TRUE, as.function = FALSE)

Value

A list with a vector of terms for each category, or (when as.function = TRUE) a function which accepts an initial "terms" argument (a character vector), and any additional arguments determined by function entered as as.function (grepl by default).

Arguments

...: Numbers or letters corresponding to category names: ppron, ipron, article, adverb, conj, prep, auxverb, negate, quant, interrog, number, interjection, or special.
as.regex: Logical: if FALSE, lists are returned without regular expression.
as.function: Logical or a function: if specified and as.regex is TRUE, the selected dictionary will be collapsed to a regex string (terms separated by |), and a function for matching characters to that string will be returned. The regex string is passed to the matching function (grepl by default) as a 'pattern' argument, with the first argument of the returned function being passed as an 'x' argument. See examples.

Examples

Run this code

# return the full dictionary (excluding special)
lma_dict()

# return the standard 7 category lsm categories
lma_dict(1:7)

# return just a few categories without regular expression
lma_dict(neg, ppron, aux, as.regex = FALSE)

# return special specifically
lma_dict(special)

# returning a function
is.ppron <- lma_dict(ppron, as.function = TRUE)
is.ppron(c("i", "am", "you", "were"))

in.lsmcat <- lma_dict(1:7, as.function = TRUE)
in.lsmcat(c("a", "frog", "for", "me"))

## use as a stopword filter
is.stopword <- lma_dict(as.function = TRUE)
dtm <- lma_dtm("Most of these words might not be all that relevant.")
dtm[, !is.stopword(colnames(dtm))]

## use to replace special characters
clean <- lma_dict(special, as.function = gsub)
clean(c(
  "\u201Ccurly quotes\u201D", "na\u00EFve", "typographer\u2019s apostrophe",
  "en\u2013dash", "em\u2014dash"
))

Run the code above in your browser using DataLab

Description

Usage

Value

Arguments

See Also

Examples