prediction: Perform predictive inference in a Gaussian mixture dynamic Bayesian network

Description

This function performs predictive inference in a Gaussian mixture dynamic Bayesian network. For a sequence of \(T\) time slices, this task consists in defining a time horizon \(h\) such that at each time slice \(t\) (for \(0 \le t \le T - h\)), the state of the system at \(t + h\) is estimated given all the data (the evidence) collected up to \(t\). Although the states at \(t + 1, \dots , t + h\) are observed in the future, some information about them can be known a priori (such as contextual information or features controlled by the user). This "predicted" evidence can be taken into account when propagating the particles from \(t\) to \(t + h\) in order to improve the predictions. Predictive inference is performed by sequential importance resampling, which is a particle-based approximate method (Koller and Friedman, 2009).

Usage

prediction(
  gmdbn,
  evid,
  evid_pred = NULL,
  nodes = names(gmdbn$b_1),
  col_seq = NULL,
  horizon = 1,
  n_part = 1000,
  max_part_sim = 1e+06,
  min_ess = 1,
  verbose = FALSE
)

Value

If horizon has one element, a data frame with a structure similar to evid containing the predicted values of the inferred nodes and their observation sequences (if col_seq is not NULL). If horizon has two or more elements, a list of data frames (tibbles) containing these values for each time horizon.

Arguments

gmdbn: An object of class gmdbn.
evid: A data frame containing the evidence. Its columns must explicitly be named after nodes of gmdbn and can contain missing values (columns with no value can be removed).
evid_pred: A data frame containing the "predicted" evidence. Its columns must explicitly be named after nodes of gmdbn and can contain missing values (columns with no value can be removed).
nodes: A character vector containing the inferred nodes (by default all the nodes of gmdbn).
col_seq: A character vector containing the column names of evid and evid_pred that describe the observation sequence. If NULL (the default), all the observations belong to a single sequence. The observations of a same sequence must be ordered such that the \(t\)th one is related to time slice \(t\) (note that the sequences can have different lengths).
horizon: A positive integer vector containing the time horizons for which predictive inference is performed.
n_part: A positive integer corresponding to the number of particles generated for each observation sequence.
max_part_sim: An integer greater than or equal to n_part corresponding to the maximum number of particles that can be processed simultaneously. This argument is used to prevent memory overflow, dividing evid into smaller subsets that are handled sequentially.
min_ess: A numeric value in [0, 1] corresponding to the minimum ESS (expressed as a proportion of n_part) under which the renewal step of sequential importance resampling is performed. If 1 (the default), this step is performed at each time slice.
verbose: A logical value indicating whether subsets of evid and time slices in progress are displayed.

References

Koller, D. and Friedman, N. (2009). Probabilistic Graphical Models: Principles and Techniques. The MIT Press.

Examples

Run this code

# \donttest{
set.seed(0)
data(gmdbn_air, data_air)
evid <- data_air
evid$NO2[sample.int(7680, 1536)] <- NA
evid$O3[sample.int(7680, 1536)] <- NA
pred <- prediction(gmdbn_air, evid, evid[, c("DATE", "TEMP", "WIND")],
                   nodes = c("NO2", "O3"), col_seq = "DATE",
                   horizon = c(1, 2), verbose = TRUE)# }