Learn R Programming

featForge (version 0.1.2)

extract_basic_description_features: Extract Basic Description Features

Description

This function processes a vector of text descriptions (such as transaction descriptions) and computes a set of basic text features. These features include counts of digits, special characters, punctuation, words, characters, unique characters, and letter cases, as well as word length statistics and the Shannon entropy of the text.

Usage

extract_basic_description_features(descriptions)

Value

A data frame where each row corresponds to an element in descriptions and each column represents a computed feature.

Arguments

descriptions

A character vector of text descriptions to be processed.

Details

The extracted features are:

has_digits

A binary indicator (0/1) showing whether the description contains any digit.

n_digits

The total count of digit characters in the description.

n_special

The number of special characters (non-alphanumeric and non-whitespace) present.

n_punct

The count of punctuation marks found in the description.

n_words

The number of words in the description.

n_chars

The total number of characters in the description.

n_unique_chars

The count of unique characters in the description.

n_upper

The count of uppercase letters in the description.

n_letters

The total count of alphabetic characters (both uppercase and lowercase) in the description.

prop_caps

The proportion of letters in the description that are uppercase.

n_whitespace

The number of whitespace characters (spaces) in the description.

avg_word_length

The average word length within the description.

min_word_length

The length of the shortest word in the description.

max_word_length

The length of the longest word in the description.

entropy

The Shannon entropy of the description, indicating its character diversity.

The function uses vectorized string operations (e.g., grepl, gregexpr, and nchar) for efficiency, which makes it suitable for processing large datasets. The resulting numeric features can then be used directly for further statistical analysis or machine learning, or they can be aggregated to higher levels.

Examples

Run this code
# Example 1: Extract features from a vector of sample descriptions.
descs <- c("KappaCredit#101",
           "Transferred funds for service fee 990",
           "Mighty remittance code 99816 casino")
extract_basic_description_features(descs)

# Example 2: Aggregate the maximum word length per application.
# Load the sample transactions data.
data(featForge_transactions)

# Combine the transactions data with extracted basic description features.
trans <- cbind(featForge_transactions,
               extract_basic_description_features(featForge_transactions$description))

# Aggregate the maximum word length on the application level.
aggregated <- aggregate_applications(
  trans,
  id_col = "application_id",
  amount_col = "max_word_length",
  ops = list(max_description_word_length = max),
  period = "all"
)

# Display the aggregated results.
aggregated

Run the code above in your browser using DataLab