subset_aggregation: Subset Aggregation over locations and data streams, naive or fast.

Description

Compute the most likely cluster (MLC) using either of three versions of the Subset Aggregation method by Neill et al. (2013). The methods are:

FF: Fast optimization over both subsets of locations and subsets of data streams.
FN: Fast optimization over subsets of locations and naive optimization over subsets of streams. Can be used if the number of data streams is small.
NF: Fast optimization over subsets of streams and naive optimization over subsets of locations. Can be used if the number of locations is small.

Usage

subset_aggregation(args, score_fun = poisson_score,
  priority_fun = poisson_priority, algorithm = "FF", R = 50,
  rel_tol = 0.01)

Arguments

args

A list of arrays:

counts: Required. An array of counts (integer or numeric). First dimension is time, ordered from most recent to most distant. Second dimension indicates locations, which will be enumerated from 1 and up. Third dimension indicates data streams, which will be enumerated from 1 and up.
baselines: Required. A matrix of expected counts. Dimensions are as for counts.
penalties: Optional. A matrix of penalty terms. Dimensions are as for counts.
...: Optional. More matrices with distribution parameters. Dimensions are as for counts.

score_fun

A function taking matrix arguments, all of the same dimension, and returning a matrix or vector of that dimension. Suitable alternatives are poisson_score, gaussian_score.

priority_fun

A function taking matrix arguments, all of the same dimension, and returning a matrix or vector of that dimension. Suitable alternatives are poisson_priority, gaussian_priority.

algorithm

Either "FN" or "NF":

FN: Fast optimization over subsets of locations and naive optimization over subsets of streams. Can be used if the number of data streams is small.
NF: Fast optimization over subsets of streams and naive optimization over subsets of locations. Can be used if the number of locations is small.

The number of random restarts.

rel_tol

The relative tolerance criterion. If the current score divided by the previous score, minus one, is less than this number then the algorithm is deemed to have converged.

Value

A list containing the most likely cluster (MLC), having the following elements:

score: A scalar; the score of the MLC.
duration: An integer; the duration of the MLC, i.e. how many time periods from the present into the past the MLC stretches.
locations: An integer vector; the locations contained in the MLC.
streams: An integer vector; the data streams contained in the MLC.
random_restarts: FF only. The number of random restarts performed.
iter_to_conv: FF only. The number of iterations it took to reach convergence for each random restart.

Details

Note: algorithm not quite as in Neill et al. (2013) since the randomly chosen subset of streams is the same for all time windows.

References

Neill, Daniel B., Edward McFowland, and Huanian Zheng (2013). Fast subset scan for multivariate event detection. Statistics in Medicine 32 (13), pp. 2185-2208.

Examples

Run this code

# NOT RUN {
# Set simulation parameters (small)
set.seed(1)
n_loc <- 20
n_dur <- 10
n_streams <- 2
n_tot <- n_loc * n_dur * n_streams

# Generate baselines and possibly other distribution parameters
baselines <- rexp(n_tot, 1/5) + rexp(n_tot, 1/5)
sigma2s <- rexp(n_tot)

# Generate counts
counts <- rpois(n_tot, baselines)

# Reshape into arrays
counts <- array(counts, c(n_dur, n_loc, n_streams))
baselines <- array(baselines, c(n_dur, n_loc, n_streams))
sigma2s <- array(sigma2s, c(n_dur, n_loc, n_streams))

# Inject an outbreak/event
ob_loc <- 1:floor(n_loc / 4)
ob_dur <- 1:floor(n_dur / 4)
ob_streams <- 1:floor(n_streams / 2)
counts[ob_dur, ob_loc, ob_streams] <- 4 * counts[ob_dur, ob_loc, ob_streams]

# Run the FN algorithm
FN_res <- subset_aggregation(
  list(counts = counts, baselines = baselines),
  score_fun = poisson_score,
  priority_fun = poisson_priority,
  algorithm = "FN")
  
# Run the FF algorithm (few random restarts)
FN_res <- subset_aggregation(
  list(counts = counts, baselines = baselines),
  score_fun = poisson_score,
  priority_fun = poisson_priority,
  algorithm = "FN",
  R = 10)
# }

Run the code above in your browser using DataLab