selectFeatures: select features from an object

Description

This function selects or discards features from a dfm.variety of objects, such as tokenized texts, a dfm, or a list of collocations. The most common usage for removeFeatures will be to eliminate stop words from a text or text-based object, or to select only features from a list of regular expression.

Usage

selectFeatures(x, features, ...)
## S3 method for class 'dfm':
selectFeatures(x, features, selection = c("keep", "remove"),
  valuetype = c("glob", "regex", "fixed"), case_insensitive = TRUE,
  verbose = TRUE, ...)
## S3 method for class 'tokenizedTexts':
selectFeatures(x, features, selection = c("keep",
  "remove"), valuetype = c("glob", "regex", "fixed"),
  case_insensitive = TRUE, verbose = TRUE, ...)
## S3 method for class 'collocations':
selectFeatures(x, features, selection = c("keep",
  "remove"), valuetype = c("glob", "regex", "fixed"),
  case_insensitive = TRUE, verbose = TRUE, pos = 1:3, ...)

Arguments

object whose features will be selected

features

one of: a character vector of features to be selected, a dfm whose features will be used for selection, or a dictionary class object whose values (not keys) will provide the features to be selected. For

...

supplementary arguments passed to the underlying functions in stri_detect_regex. (This is how case_insensitive is passed, but you may wish to pass others.)

selection

whether to keep or remove the features

valuetype

how to interpret feature vector: fixed for words as is; "regex" for regular expressions; or "glob" for "glob"-style wildcard

case_insensitive

ignore the case of dictionary values if TRUE

verbose

if TRUE print message about how many features were removed

pos

indexes of word position if called on collocations: remove if word pos is a stopword

Value

A dfm after the feature selection has been applied. When features is a dfm-class object, then the returned object will be identical in its feature set to the dfm supplied as the features argument. This means that any features in x not in features will be discarded, and that any features in found in the dfm supplied as features but not found in x will be added with all zero counts. This is useful when you have trained a model on one dfm, and need to project this onto a test set whose features must be identical.

Examples

Run this code

myDfm <- dfm(c("My Christmas was ruined by your opposition tax plan.", 
               "Does the United_States or Sweden have more progressive taxation?"),
             toLower = FALSE, verbose = FALSE)
mydict <- dictionary(list(countries = c("United_States", "Sweden", "France"),
                          wordsEndingInY = c("by", "my"),
                          notintext = "blahblah"))
selectFeatures(myDfm, mydict)
selectFeatures(myDfm, mydict, case_insensitive = FALSE)
selectFeatures(myDfm, c("s$", ".y"), "keep")
selectFeatures(myDfm, c("s$", ".y"), "keep", valuetype = "regex")
selectFeatures(myDfm, c("s$", ".y"), "remove", valuetype = "regex")
selectFeatures(myDfm, stopwords("english"), "keep", valuetype = "fixed")
selectFeatures(myDfm, stopwords("english"), "remove", valuetype = "fixed")

# selecting on a dfm
textVec1 <- c("This is text one.", "This, the second text.", "Here: the third text.")
textVec2 <- c("Here are new words.", "New words in this text.")
features(dfm1 <- dfm(textVec1))
features(dfm2a <- dfm(textVec2))
(dfm2b <- selectFeatures(dfm2a, dfm1))
identical(features(dfm1), features(dfm2b))
toks <- tokenize(c("This is some example text from me.", "More of the example text."), 
                 removePunct = TRUE)
selectFeatures(toks, stopwords("english"), "remove")
selectFeatures(toks, "ex", "keep", valuetype = "regex")
 

## example for collocations
(myCollocs <- collocations(inaugTexts[1:3], n=20))
selectFeatures(myCollocs, stopwords("english"), "remove")

Run the code above in your browser using DataLab

Description

Usage

Arguments

Value

See Also

Examples