create_lagged_df: Create model training and forecasting datasets with lagged, grouped, dynamic, and static features

Description

Create a list of datasets with lagged, grouped, dynamic, and static features to (a) train forecasting models for specified forecast horizons and (b) forecast into the future with a trained ML model.

Usage

create_lagged_df(data, type = c("train", "forecast"), outcome_col = 1L,
  horizons, lookback = NULL, lookback_control = NULL, dates = NULL,
  frequency = NULL, dynamic_features = NULL, groups = NULL,
  static_features = NULL, use_future = FALSE)

Arguments

data

A data.frame with the (a) target to be forecasted and (b) features/predictors. An optional date column can be given in the dates argument (required for grouped time-series). Note that forecastML only works with regularly spaced time/date intervals and that missing rows--usually due to periods when no data was collected--will result in poorly trained models due to incorrect feature lags. Use fill_gaps to fill in any missing rows/data prior to running this function.

type

The type of dataset to return--(a) model training or (b) forecast prediction. The default is train.

outcome_col

The column index--an integer--of the target to be forecasted. Forecasting only one outcome column is allowed at present, however, groups of time-series can be forecasted if they are stacked vertically in a long dataset and the groups, dates, and frequency arguments are specified.

horizons

A numeric vector of one or more forecast horizons, h, measured in input dataset rows. For each horizon, 1:h forecasts are returned (e.g., horizons = 12 trains a model to minimize 1 to 12-step-ahead error and returns forecasts for 1:12 steps into the future). If dates are given, a horizon of 1, for example, would equal 1 * frequency in calendar time.

lookback

A numeric vector giving the lags--in dataset rows--for creating the lagged features. All non-grouping, non-static, and non-dynamic features in the input dataset, data, are lagged by the same values. Lags that don't support direct forecasting for a given horizon are silently dropped. Either lookback or lookback_control need to be specified.

lookback_control

A list of numeric vectors, specifying potentially unique lags for each feature. The length of the list should equal ncol(data) and be ordered the same as the columns in data. For grouped time-series, lags for the grouping column(s) and static feature columns should have a lookback_control value of 0. list(NULL) lookback_control values drop columns from the input dataset. Lags that don't support direct forecasting for a given horizons are silently dropped. Either lookback or lookback_control need to be specified.

dates

A vector or 1-column data.frame of dates with class 'Date'. The length of dates should equal nrow(data). Required if groups are given.

frequency

Date frequency. A string taking the same input as base::seq.Date(..., by = "frequency") e.g., '1 month', '7 days', '10 years' etc. The highest frequency supported at present is '1 day'. Required if dates are given.

dynamic_features

A character vector of column names that identify features that change through time but which are not lagged (e.g., weekday or year). If type = "forecast", these features will receive NA values; though, they can be filled in by the user after running this function.

groups

A character vector of column names that identify the groups/hierarchies when multiple time-series are present. These columns are used as model features but are not lagged. Note that combining feature lags with grouped time series will result in NA values throughout the data.

static_features

For grouped time series only. A character vector of column names that identify features that do not change through time. These columns are not lagged. If type = "forecast", these features will be filled forward using the most recent value for the group.

use_future

Boolean. If TRUE, the future package is used for creating lagged data.frames. multisession or multicore futures are especially useful for (a) grouped time series with many groups and (b) high-dimensional datasets with many lags per feature. Run future::plan(future::multiprocess) prior to this function to set up multissession or multicore parallel dataset creation.

Value

An S3 object of class 'lagged_df' or 'grouped_lagged_df': A list of data.frames with new columns for the lagged/non-lagged features. The length of the returned list is equal to the number of forecast horizons and is in the order of horizons supplied to the horizons argument. Horizon-specific datasets can be accessed with my_lagged_df$horizon_h where 'h' gives the forecast horizon.

The contents of the returned data.frames are as follows:

type = 'train', non-grouped:: A data.frame with the outcome and lagged features with the first 1:max(lookback) rows removed.
type = 'train', grouped:: A data.frame with the outcome and unlagged grouping columns followed by lagged, dynamic, and static features.
type = 'forecast', non-grouped:: (1) An 'index' column giving the row index or date of the forecast periods (e.g., a 100 row non-date-based training dataset would start with an index of 101). (2) A 'horizon' column that indicates the forecast period from 1:max(horizons). (3) Lagged features identical to the 'train', non-grouped dataset.
type = 'forecast', grouped:: (1) An 'index' column giving the date of the forecast periods. The first forecast date for each group is the maximum date from the dates argument + 1 * frequency which is the user-supplied date frequency.(2) A 'horizon' column that indicates the forecast period from 1:max(horizons). (3) Lagged, static, and dynamic features identical to the 'train', grouped dataset.

Attributes

names: The horizons-specific datasets that can be accessed by my_lagged_df$horizon_h where 'h' gives the forecast horizons.
type: Training, train, or forecasting, forecast, dataset(s).
horizons: Forecast horizons measured in dataset rows.
outcome_col: The column index of the target being forecasted.
outcome_names: The name of the target being forecasted.
predictor_names: The predictor or feature names from the input dataset.
row_indices: The row.names() of the output dataset. For non-grouped datasets, the first lookback + 1 rows are removed from the beginning of the dataset to remove NA values in the lagged features.
date_indices: If dates are given, the vector of dates.
frequency: If dates are given, the date/time frequency.
data_start: min(row_indices) or min(date_indices).
data_stop: max(row_indices) or max(date_indices).
groups: If groups are given, a vector of group names.
class: grouped_lagged_df, lagged_df, list

Methods and related functions

The output of create_lagged_df() is passed into

create_windows

and has the following generic S3 methods

summary
plot

Examples

Run this code

# NOT RUN {
# Sampled Seatbelts data from the R package datasets.
data("data_seatbelts", package = "forecastML")
#------------------------------------------------------------------------------
# Example 1 - Training data for 2 horizon-specific models w/ common lags per predictor.
horizons <- c(1, 12)
lookback <- 1:15

data <- data_seatbelts

data_train <- create_lagged_df(data_seatbelts, type = "train", outcome_col = 1,
                               horizons = horizons, lookback = lookback)
head(data_train[[length(horizons)]])

# Example 1 - Forecasting dataset
# The last 'nrow(data_seatbelts) - horizon' rows are automatically used from data_seatbelts.
data_forecast <- create_lagged_df(data_seatbelts, type = "forecast", outcome_col = 1,
                                  horizons = horizons, lookback = lookback)
head(data_forecast[[length(horizons)]])

#------------------------------------------------------------------------------
# Example 2 - Training data for one 3-month horizon model w/ unique lags per predictor.
horizons <- 3
lookback <- list(c(3, 6, 9, 12), c(4:12), c(6:15), c(8))

data_train <- create_lagged_df(data_seatbelts, type = "train", outcome_col = 1,
                               horizons = horizons, lookback_control = lookback)
head(data_train[[length(horizons)]])
# }

Run the code above in your browser using DataLab