aggregate_applications: Aggregate Numeric Data by Periods

Description

Aggregates any numeric variable(s) in a dataset over defined time periods and returns summary features computed from provided operation functions. E.g., aggregating and making features from transactional data, previous loan repayment behavior, credit bureau inquiries. Aggregation is performed by a specified grouping identifier (e.g., application, client, or agreement level) and is based on time-periods.

Usage

aggregate_applications(
  data,
  id_col,
  amount_col,
  time_col = NULL,
  group_cols = NULL,
  ops,
  period,
  observation_window_start_col = NULL,
  scrape_date_col = NULL,
  period_agg = sum,
  period_missing_inputs = 0
)

Value

A data frame where each row corresponds to a unique identifier (e.g., application, client, or agreement). The output includes aggregated summary features for each period and, if applicable, additional columns for each group defined in group_cols.

Arguments

data: A data frame containing the data to be aggregated. The dataset must include at least the columns specified by id_col, time_col, and amount_col (or any numeric variable to aggregate).
id_col: A character string specifying the column name used to define the aggregation level (e.g., "application_id", "client_id", or "agreement_id").
amount_col: A character string specifying the column in data that contains the numeric variable to be aggregated. This variable can represent transaction amounts, loan repayment values, credit bureau inquiry counts, or any other numeric measure.
time_col: A character string indicating the column name that contains the date (or timestamp) when the event occurred. This column must be of class Date or POSIXct.
group_cols: An optional character vector of column names by which to further subdivide the aggregation. For each unique value in these columns, separate summary features will be generated and appended as new columns.
ops: A named list of functions used to compute summary features on the aggregated period values. Each function must accept a single numeric vector as input. The names of the list elements are used to label the output columns.
period: Either a character string specifying the time period grouping ("daily", "weekly", "monthly", or "all") or a numeric vector of length 2 (e.g., c(7, 8)) where the first element is the cycle length in days and the second is the number of consecutive cycles. When set to "all", the entire set of observations is aggregated as a single period, effectively disabling time aggregation.
observation_window_start_col: A character string indicating the column name that contains the observation window start date. This argument is required when period is specified as a character string other than "all".
scrape_date_col: A character string indicating the column name that contains the scrape date (i.e., the end date for the observation window). This is required when period is specified as a character string other than "all" or as a numeric vector.
period_agg: A function used to aggregate the numeric values within each period. The default is sum. The argument is ignored if period is "all".
period_missing_inputs: A numeric constant used to replace missing values in periods with no observed data. The default value is 0.

Details

When period is provided as a character string (one of "daily", "weekly", or "monthly"), data are grouped into complete calendar periods. For example, if the scrape date falls mid-month, the incomplete last period is excluded. Alternatively, period may be specified as a numeric vector of length 2 (e.g., c(7, 8)), in which case the first element defines the cycle length in days and the second element the number of consecutive cycles. In this example, if the scrape date is "2024-12-31", the periods span the last 56 days (8 consecutive 7-day cycles), with the first period starting on "2024-11-05".

aggregate_applications aggregates numeric data either by defined time periods or over the full observation window. Data is first grouped by the identifier specified in id_col (e.g., at the application, client, or agreement level).

When period is set to "daily", "weekly", or "monthly", transaction dates in time_col are partitioned into complete calendar periods (incomplete periods are excluded).
When period is set to a numeric vector of length 2 (e.g., c(7, 8)), consecutive cycles of fixed length are defined.
When period is set to "all", time aggregation is disabled. All observations for an identifier (or group) are aggregated together.

For each period, the numeric values in amount_col (or any other numeric measure) are aggregated using the function specified by period_agg. Then, for each unique group (if any group_cols are provided) and for each application (or other identifier), the summary functions specified in ops are applied to the vector of aggregated period values. When grouping is used, the resulting summary features are appended as new columns with names constructed in the format: <operation>_<group_column>_<group_value>. Missing aggregated values in periods with no observations are replaced by period_missing_inputs.

Examples

Run this code

data(featForge_transactions)

# Example 1: Aggregate outgoing transactions (amount < 0) on a monthly basis.
aggregate_applications(featForge_transactions[featForge_transactions$amount < 0, ],
                       id_col = 'application_id',
                       amount_col = 'amount',
                       time_col = 'transaction_date',
                       ops = list(
                         avg_momnthly_outgoing_transactions = mean,
                         last_month_transactions_amount = function(x) x[length(x)],
# In the aggregated numeric vector, the last observation represents the most recent period.
                         last_month_transaction_amount_vs_mean = function(x) x[length(x)] / mean(x)
                       ),
                       period = 'monthly',
                       observation_window_start_col = 'obs_start',
                       scrape_date_col = 'scrape_date'
)

# Example 2: Aggregate transactions by category and direction.
featForge_transactions$direction <- ifelse(featForge_transactions$amount > 0, 'in', 'out')
aggregate_applications(featForge_transactions,
                       id_col = 'application_id',
                       amount_col = 'amount',
                       time_col = 'transaction_date',
                       group_cols = c('category', 'direction'),
                       ops = list(
                         avg_monthly_transactions = mean,
                         highest_monthly_transactions_count = max
                       ),
                       period = 'monthly',
                       period_agg = length,
                       observation_window_start_col = 'obs_start',
                       scrape_date_col = 'scrape_date'
)

# Example 3: Aggregate using a custom numeric period:
# 30-day cycles for 3 consecutive cycles (i.e., the last 90 days).
aggregate_applications(featForge_transactions,
                       id_col = 'application_id',
                       amount_col = 'amount',
                       time_col = 'transaction_date',
                       ops = list(
                         avg_30_day_transaction_count_last_90_days = mean
                       ),
                       period = c(30, 3),
                       period_agg = length,
                       observation_window_start_col = 'obs_start',
                       scrape_date_col = 'scrape_date'
)

# Example 4: Aggregate transactions without time segmentation.
aggregate_applications(featForge_transactions,
                       id_col = 'application_id',
                       amount_col = 'amount',
                       ops = list(
                         total_transactions_counted = length,
                         total_outgoing_transactions_counted = function(x) sum(x < 0),
                         total_incoming_transactions_counted = function(x) sum(x > 0)
                       ),
                       period = 'all'
)