txt.to.features: Split string of words or other countable features

Description

Function that converts a vector of words into either words, or characters, and optionally parses them into n-grams.

Usage

txt.to.features(tokenized.text, features = "w", ngram.size = 1)

Arguments

tokenized.text

a vector of tokinzed words

features

an option for specifying the desired type of feature: w for words, c for characters (default: w).

ngram.size

an optional argument (integer) indicating the value of n, or the size of n-grams to be created. If this argument is missing, the default value of 1 is used.

Details

Function that carries out the preprocessing steps necessary for feature selection: converts an input text into the type of sequences needed (n-grams etc.) and returns a new vector of items. The function invokes make.ngrams to combine single units into pairs, triplets or longer n-grams. See help(make.ngrams) for details.

Examples

Run this code

# consider the string my.text:
my.text = "Quousque tandem abutere, Catilina, patientia nostra?"

# split it into a vector of consecutive words:
my.vector.of.words = txt.to.words(my.text)

# build a vector of word 2-grams:
txt.to.features(my.vector.of.words, ngram.size = 2)
 
# or produce character n-grams (in this case, character tetragrams):
txt.to.features(my.vector.of.words, features = "c", ngram.size = 4)

Run the code above in your browser using DataLab

Description

Usage

Arguments

Details

See Also

Examples