bn.cv: Cross-validation for Bayesian networks

Description

Perform a k-fold or hold-out cross-validation for a learning algorithm or a fixed network structure.

Usage

bn.cv(data, bn, loss = NULL, k = 10, m, runs = 1, algorithm.args = list(),
  loss.args = list(), fit = "mle", fit.args = list(), method = "k-fold",
  cluster = NULL, debug = FALSE)
## S3 method for class 'bn.kcv':
plot(x, ..., main, xlab, ylab, connect = FALSE)
## S3 method for class 'bn.kcv.list':
plot(x, ..., main, xlab, ylab, connect = FALSE)

Arguments

data

a data frame containing the variables in the model.

either a character string (the label of the learning algorithm to be applied to the training data in each iteration) or an object of class bn (a fixed network structure).

loss

a character string, the label of a loss function. If none is specified, the default loss function is the Classification Error for Bayesian networks classifiers; otherwise, the Log-Likelihood Loss for both discrete and continu

a positive integer number, the number of groups into which the data will be split.

a positive integer number, the size of the test set in hold-out cross-validation.

runs

a positive integer number, the number of times cross-validation will be run.

algorithm.args

a list of extra arguments to be passed to the learning algorithm.

loss.args

a list of extra arguments to be passed to the loss function specified by loss.

fit

a character string, the label of the method used to fit the parameters of the newtork. See bn.fit for details.

fit.args

additional arguments for the parameter estimation procedure, see again bn.fit for details.

method

a character string, either k-fold or hold-out. See below for details.

cluster

an optional cluster object from package parallel. See parallel integration for details and a simple example.

debug

a boolean value. If TRUE a lot of debugging output is printed; otherwise the function is completely silent.

an object of class bn.kcv or bn.kcv.list returned by bn.cv.

...

additional objects of class bn.kcv or bn.kcv.list to plot alongside the first.

main, xlab, ylab

the title of the plot, an array of labels for the boxplot, the label for the y axis.

connect

a logical value. If TRUE, the medians points in the boxplots will be connected by a segmented line.

Value

An object of class bn.kcv.list if runs is at least 2, an object of class bn.kcv if runs is equal to 1.

Cross-Validation Strategies

The following cross-validation strategies are implemented:

k-fold: thedataare split inksubsets of equal size. For each subset in turn,bnis fitted (and possibly learned as well) on the otherk - 1subsets and the loss function is then computed using that subset. Loss estimates for each of theksubsets are then combined to give an overall loss fordata.
hold-out:ksubsamples of sizemare sampled independently without replacement from thedata. For each subsample,bnis fitted (and possibly learned) on the remainingm - nrow(data)samples and the loss function is computed on themobservations in the subsample. The overall loss estimate is the average of thekloss estimates from the subsamples.

If either cross-validation is used with multiple runs, the overall loss is the averge of the loss estimates from the different runs.

Loss Functions

The following loss functions are implemented:

Log-Likelihood Loss(logl): also known asnegative entropyornegentropy, it is the negated expected log-likelihood of the test set for the Bayesian network fitted from the training set.
Gaussian Log-Likelihood Loss(logl-g): the negated expected log-likelihood for Gaussian Bayesian networks.
Classification Error(pred): theprediction errorfor a single node in a discrete network. Frequentist predictions are used, so the values of the target node are predicted using only the information present in its local distribution (from its parents).
Posterior Classification Error(pred-lw): similar to the above, but predictions are computed from an arbitrary set of nodes using likelihood weighting to obtain Bayesian posterior estimates.
Predictive Correlation(cor): thecorrelationbetween the observed and the predicted values for a single node in a Gaussian Bayesian network.
Posterior Predictive Correlation(cor-lw): similar to the above, but predictions are computed from an arbitrary set of nodes using likelihood weighting to obtain Bayesian posterior estimates.
Mean Squared Error(mse): themean squared errorbetween the observed and the predicted values for a single node in a Gaussian Bayesian network.
Posterior Mean Squared Error(mse-lw): similar to the above, but predictions are computed from an arbitrary set of nodes using likelihood weighting to obtain Bayesian posterior estimates.

Optional arguments that can be specified in loss.args are:

target: a character string, the label of target node for prediction in all loss functions butlogl,logl-gandlogl-cg.
from: a vector of character strings, the labels of the nodes used to predict thetargetnode inpred-lw,cor-lwandmse-lw. The default is to use all the other nodes in the network. Loss functionspred,corandmseimplicitly predict only from the parents of thetargetnode.
n: a positive integer, the number of particles used by likelihood weighting forpred-lw,cor-lwandmse-lw. The default value is500.

Note that if bn is a Bayesian network classifier, pred and pred-lw both give exact posterior predictions computed using the closed-form formulas for naive Bayes and TAN.

Plotting Results from Cross-Validation

Both plot methods accept any combination of objects of class bn.kcv or bn.kcv.list (the first as the x argument, the remaining as the ... argument) and plot the respected expected loss values side by side. For a bn.kcv object, this mean a single point; for a bn.kcv.list object this means a boxplot.

References

Koller D, Friedman N (2009). Probabilistic Graphical Models: Principles and Techniques. MIT Press.

Examples

Run this code

bn.cv(learning.test, 'hc', loss = "pred", loss.args = list(target = "F"))
bn.cv(gaussian.test, 'mmhc', method = "hold-out", k = 5, m = 50, runs = 2)

gaussian.subset = gaussian.test[1:50, ]
cv.gs = bn.cv(gaussian.subset, 'gs', runs = 10)
cv.iamb = bn.cv(gaussian.subset, 'iamb', runs = 10)
cv.inter = bn.cv(gaussian.subset, 'inter.iamb', runs = 10)
plot(cv.gs, cv.iamb, cv.inter, 
  xlab = c("Grow-Shrink", "IAMB", "Inter-IAMB"), connect = TRUE)

Run the code above in your browser using DataLab