extract_basic_description_features: Extract Basic Description Features

Description

This function processes a vector of text descriptions (such as transaction descriptions) and computes a set of basic text features. These features include counts of digits, special characters, punctuation, words, characters, unique characters, and letter cases, as well as word length statistics and the Shannon entropy of the text.

Usage

extract_basic_description_features(descriptions)

Value

A data frame where each row corresponds to an element in descriptions and each column represents a computed feature.

Arguments

descriptions: A character vector of text descriptions to be processed.

Details

The extracted features are:

has_digits: A binary indicator (0/1) showing whether the description contains any digit.
n_digits: The total count of digit characters in the description.
n_special: The number of special characters (non-alphanumeric and non-whitespace) present.
n_punct: The count of punctuation marks found in the description.
n_words: The number of words in the description.
n_chars: The total number of characters in the description.
n_unique_chars: The count of unique characters in the description.
n_upper: The count of uppercase letters in the description.
n_letters: The total count of alphabetic characters (both uppercase and lowercase) in the description.
prop_caps: The proportion of letters in the description that are uppercase.
n_whitespace: The number of whitespace characters (spaces) in the description.
avg_word_length: The average word length within the description.
min_word_length: The length of the shortest word in the description.
max_word_length: The length of the longest word in the description.
entropy: The Shannon entropy of the description, indicating its character diversity.

The function uses vectorized string operations (e.g., grepl, gregexpr, and nchar) for efficiency, which makes it suitable for processing large datasets. The resulting numeric features can then be used directly for further statistical analysis or machine learning, or they can be aggregated to higher levels.

Examples

Run this code

# Example 1: Extract features from a vector of sample descriptions.
descs <- c("KappaCredit#101",
           "Transferred funds for service fee 990",
           "Mighty remittance code 99816 casino")
extract_basic_description_features(descs)

# Example 2: Aggregate the maximum word length per application.
# Load the sample transactions data.
data(featForge_transactions)

# Combine the transactions data with extracted basic description features.
trans <- cbind(featForge_transactions,
               extract_basic_description_features(featForge_transactions$description))

# Aggregate the maximum word length on the application level.
aggregated <- aggregate_applications(
  trans,
  id_col = "application_id",
  amount_col = "max_word_length",
  ops = list(max_description_word_length = max),
  period = "all"
)

# Display the aggregated results.
aggregated

Run the code above in your browser using DataLab