scan_negbin: Calculate the negative binomial scan statistic.

Description

Calculate the expectation-based negative binomial scan statistic by supplying a data.table of observed counts and pre-computed distribution parameters for each location and time. A p-value for the observed scan statistic can be obtained by Monte Carlo simulation.

Usage

scan_negbin(table, zones, n_mcsim = 0, version = "ordinary")

Arguments

table

A data.table with columns location, duration, mu, theta, count. The location column should consist of integers that are unique to each location. The duration column should also consist of integers, starting at 1 for the most recent time period and increasing in reverse chronological order.

A negative binomial distribution parametrized by $\mu$ and $\theta$ (columns mu and theta respectively) has expected value $\mu$ and variance $\mu+\mu^2/\theta$. The parameter $\theta$ is referred to as the size in NegBinomial, and theta in negative.binomial.

zones

A set of zones, each zone itself a set containing one or more locations of those found in table.

n_mcsim

A non-negative integer; the number of replicate scan statistics to generate in order to calculate a p-value.

version

Which version of the negative binomial score scan statistic to calculate: either "ordinary" (default) or "increasing". See details.

Value

An object of class scanstatistics. It has the following fields:

observed: A data.table containing the value of the statistic calculated for each zone-duration combination, for the observed data. The scan statistic is the maximum value of these calculated statistics.
replicated: A numeric vector of length n_mcsim containing the values of the scanstatistics calculated by Monte Carlo simulation.
mlc: A data.table containing the zone, duration, and scanstatistic.
pvalue: The p-value calculated from Monte Carlo replications.
distribution: The assumed distribution of the data; "negative binomial" in this case.
type: The type of scan statistic; "Expectation-based" in this case.
zones: The set of zones that was passed to the function as input.
n_locations: The number of locations in the data.
n_zones: The number of zones.
max_duration: The maximum outbreak/event/anomaly duration considered.

Details

For the expectation-based negative binomial scan statistic (Tango et al., 2011), the null hypothesis of no anomaly holds that the count observed at each location $i$ and duration $t$ (the number of time periods before present) has a negative binomial distribution with expected value $\mu_{it}$ and dispersion parameter $\theta_{it}$: $$ H_0 : Y_{it} \sim \textrm{NegBin}(\mu_{it}, \theta_{it}). $$ This holds for all locations $i = 1, \ldots, m$ and all durations $t = 1, \ldots,T$, with $T$ being the maximum duration considered. The alternative hypothesis depends on the version used: if version == "ordinary", then the alternative hypothesis states that there is a space-time window $W$ consisting of a spatial zone $Z \subset \{1, \ldots, m\}$ and a time window $D \subseteq \{1, \ldots, T\}$ such that the counts in this window have their expected values inflated by a factor $q_W > 1$ compared to the null hypothesis: $$ H_1 : Y_{it} \sim \textrm{NegBin}(q_W \mu_{it}, \theta_{it}), ~~(i,t) \in W. $$ If version == "increasing", $q_W$ is instead increasing over time (decreasing with duration). For locations and durations outside of this window, counts are assumed to be distributed as under the null hypothesis. The sets $Z$ considered are those specified in the argument zones, while the maximum duration $T$ is taken as the maximum value in the column duration of the input table. For each space-time window $W$ considered, a score statistic is computed using the score function and Fisher information under the null hypothesis of no anomaly. The scan statistic is calculated as the maximum of these quantities over all space-time windows. Point estimates of the parameters $\mu_{it}$ and $\theta_{it}$ must be specified in the column mu and theta of the argument table before this function is called.

References

Tango, T., Takahashi, K. & Kohriyama, K. (2011), A space-time scan statistic for detecting emerging outbreaks, Biometrics 67(1), 106<U+2013>115.

Examples

Run this code

# NOT RUN {
# Simple example
set.seed(1)
table <- scanstatistics:::create_table(list(location = 1:4, duration = 1:4),
                                        keys = c("location", "duration"))
table[, mu := 3 * location]
table[, theta := 2]
table[, count := rnbinom(.N, mu = mu, size = theta)]
table[location %in% c(1, 4) & duration < 3, 
      count :=  rnbinom(.N, mu = 2 * mu, size = theta)]
zones <- scanstatistics:::powerset_zones(4)
result1 <- scan_negbin(table, zones, 100, "ordinary")
result2 <- scan_negbin(table, zones, 100, "increasing")
# }

Run the code above in your browser using DataLab