aggregate_psd2_keyword_features: Aggregate PSD2 Keyword Features at the Application Level with Time Window Filtering

Description

This function extracts keyword features from a transaction descriptions column using the extract_keyword_features function and then aggregates these features at the application level using the aggregate_applications function. In addition, when the aggregation period is provided as a numeric vector (e.g., c(30, 3)), the function filters out transactions that fall outside the observation window defined as the period between scrape_date - (period[1] * period[2]) and scrape_date. This prevents spending time processing keywords from transactions that would later be aggregated as zeros.

Usage

aggregate_psd2_keyword_features(
  data,
  id_col,
  description_col,
  amount_col = NULL,
  time_col = NULL,
  observation_window_start_col = NULL,
  scrape_date_col = NULL,
  ops = NULL,
  period = "all",
  separate_direction = if (!is.null(amount_col)) TRUE else FALSE,
  group_cols = NULL,
  min_freq = 1,
  use_matrix = TRUE,
  convert_to_df = TRUE,
  period_agg = sum,
  period_missing_inputs = 0
)

Value

A data frame with one row per application and aggregated keyword features.

Arguments

data: A data frame containing transaction records.
id_col: A character string specifying the column name that identifies each application (e.g., "application_id").
description_col: A character string specifying the column name that contains the transaction descriptions. Note that this column may contain NA values.
amount_col: Optional. A character string specifying the column name that contains transaction amounts. If provided, the function aggregates a value for each keyword (default ops = list(amount = sum)). If omitted (NULL), the function aggregates counts of keyword occurrence (default ops = list(count = sum)).
time_col: Optional. A character string specifying the column name that contains the transaction date (or timestamp). When period is a numeric vector, this is required to filter the data by observation window.
observation_window_start_col: Optional. A character string indicating the column name with the observation window start date. If period is not "all" and is not numeric, this column is used in aggregate_applications.
scrape_date_col: Optional. A character string indicating the column name with the scrape date. If period is not "all" and is not numeric, this column is used in aggregate_applications.
ops: A named list of functions used to compute summary features on the aggregated values. If amount_col is provided and ops is NULL, the default is list(amount = sum). If amount_col is NULL and ops is NULL, the default is list(count = sum).
period: Either a character string or a numeric vector controlling time aggregation. The default is "all", meaning no time segmentation. If a numeric vector is provided (e.g., c(30, 3)), it defines a cycle length in days (first element) and a number of consecutive cycles (second element). In that case, only transactions with a transaction date between scrape_date - (period[1] * period[2]) and scrape_date are considered.
separate_direction: Logical. If TRUE (the default when amount_col is provided), a new column "direction" is added to automatically separate incoming and outgoing transactions based on the sign of the amount.
group_cols: Optional. A character vector of additional grouping columns to use during aggregation. If separate_direction is TRUE, the "direction" grouping is added automatically.
min_freq: Numeric. The minimum frequency a token must have to be included in the keyword extraction. Default is 1.
use_matrix: Logical. Passed to extract_keyword_features; if TRUE (the default) a sparse matrix is used.
convert_to_df: Logical. Passed to extract_keyword_features; if TRUE (the default) the sparse matrix is converted to a data.frame, facilitating binding with other data.
period_agg: A function used to aggregate values within each period (see aggregate_applications). Default is sum.
period_missing_inputs: A numeric value to replace missing aggregated values. Default is 0.

Details

The function supports two modes:

If amount_col is not provided (i.e., NULL), the function aggregates keyword counts (i.e., the number of transactions in which a keyword appears) for each application.
If amount_col is provided, then for each transaction the keyword indicator is multiplied by the transaction amount. In this mode, the default aggregation operation is to sum these values (using ops = list(amount = sum)), yielding the total amount associated with transactions that mention each keyword.

Additionally, if amount_col is provided and separate_direction is TRUE (the default), a new column named "direction" is created to separate incoming ("in") and outgoing ("out") transactions based on the sign of the amount. Any additional grouping columns can be provided via group_cols.

The function performs the following steps:

Basic input checks are performed to ensure the required columns exist.
The full list of application IDs is stored from the original data.
If amount_col is provided and separate_direction is TRUE, a "direction" column is added to label transactions as incoming ("in") or outgoing ("out") based on the sign of the amount.
When period is provided as a numeric vector, the function computes the observation window as scrape_date - (period[1] * period[2]) to scrape_date and filters the dataset to include only transactions within this window. Transactions for applications with no records in the window will later be assigned zeros.
Keyword features are extracted from the description_col using extract_keyword_features. If an amount_col is provided, the binary indicators are weighted by the transaction amount.
The extracted keyword features are combined with the (possibly filtered) original data.
For each keyword, the function calls aggregate_applications to aggregate the feature by application. The aggregation is performed over time periods defined by period (if applicable) and, if requested, further split by direction.
Aggregated results for each keyword are merged by application identifier.
Finally, the aggregated results are merged with the full list of application IDs so that applications with no transactions in the observation window appear with zeros.

Examples

Run this code

# Example: Aggregate keyword features for PSD2 transactions.

data(featForge_transactions)

# In this example, the 'description' field is parsed for keywords.
# Since the 'amount' column is provided, each keyword indicator is
# weighted by the transaction amount, and transactions are
# automatically split into incoming and outgoing via the 'direction' column.
# Additionally, the period is specified as c(30, 1), meaning only
# transactions occurring within the last 30 days.
# (scrape_date - 30 to scrape_date) are considered.
result <- aggregate_psd2_keyword_features(
  data = featForge_transactions,
  id_col = "application_id",
  description_col = "description",
  amount_col = "amount",
  time_col = "transaction_date",
  scrape_date_col = "scrape_date",
  observation_window_start_col = "obs_start",
  period = c(30, 1),
  ops = list(amount = sum),
  min_freq = 1,
  use_matrix = TRUE,
  convert_to_df = TRUE
)

# The resulting data frame 'result' contains one
# row per application with aggregated keyword features.
# For example, if keywords "casino" and "utilities" were detected,
# aggregated columns might be named:
# "casino_amount_direction_in",
# "casino_amount_direction_out",
# "utilities_amount_direction_in", etc.
result

Run the code above in your browser using DataLab