Calculate the expectation-based negative binomial scan statistic by supplying
a data.table
of observed counts and pre-computed distribution
parameters for each location and time. A p-value for the observed scan
statistic can be obtained by Monte Carlo simulation.
scan_negbin(table, zones, n_mcsim = 0, version = "ordinary")
A data.table
with columns location, duration, mu,
theta, count
. The location
column should consist of integers that
are unique to each location. The duration
column should also
consist of integers, starting at 1 for the most recent time period and
increasing in reverse chronological order.
A negative binomial distribution parametrized by \(\mu\) and
\(\theta\) (columns mu
and theta
respectively) has
expected value \(\mu\) and variance \(\mu+\mu^2/\theta\). The
parameter \(\theta\) is referred to as the size
in
NegBinomial
, and theta
in
negative.binomial
.
A set
of zones, each zone itself a
set containing one or more locations of those found in table
.
A non-negative integer; the number of replicate scan statistics to generate in order to calculate a p-value.
Which version of the negative binomial score scan statistic to calculate: either "ordinary" (default) or "increasing". See details.
An object of class scanstatistics
. It has the following
fields:
A data.table
containing the value of the
statistic calculated for each zone-duration combination,
for the observed data. The scan statistic is the maximum
value of these calculated statistics.
A numeric vector of length n_mcsim
containing
the values of the scanstatistics calculated by Monte
Carlo simulation.
A data.table
containing the zone, duration, and
scanstatistic.
The p-value calculated from Monte Carlo replications.
The assumed distribution of the data; "negative binomial" in this case.
The type of scan statistic; "Expectation-based" in this case.
The set of zones that was passed to the function as input.
The number of locations in the data.
The number of zones.
The maximum outbreak/event/anomaly duration considered.
For the expectation-based negative binomial scan statistic (Tango
et al., 2011), the null hypothesis of no anomaly holds that the count
observed at each location \(i\) and duration \(t\) (the number of time
periods before present) has a negative binomial distribution with expected
value \(\mu_{it}\) and dispersion parameter \(\theta_{it}\):
$$
H_0 : Y_{it} \sim \textrm{NegBin}(\mu_{it}, \theta_{it}).
$$
This holds for all locations \(i = 1, \ldots, m\) and all durations
\(t = 1, \ldots,T\), with \(T\) being the maximum duration considered.
The alternative hypothesis depends on the version used: if version
== "ordinary"
, then the alternative hypothesis states that there is a
space-time window \(W\) consisting of a spatial zone \(Z \subset \{1,
\ldots, m\}\) and a time window \(D \subseteq \{1, \ldots, T\}\) such
that the counts in this window have their expected values inflated by a
factor \(q_W > 1\) compared to the null hypothesis:
$$
H_1 : Y_{it} \sim \textrm{NegBin}(q_W \mu_{it}, \theta_{it}),
~~(i,t) \in W.
$$
If version == "increasing"
, \(q_W\) is instead increasing over
time (decreasing with duration
).
For locations and durations outside of this window, counts are assumed to
be distributed as under the null hypothesis. The sets \(Z\) considered
are those specified in the argument zones
, while the maximum
duration \(T\) is taken as the maximum value in the column
duration
of the input table
. For each space-time window
\(W\) considered, a score statistic is computed using the score function
and Fisher information under the null hypothesis of no anomaly.
The scan statistic is calculated as the maximum of these quantities over
all space-time windows. Point estimates of the parameters \(\mu_{it}\)
and \(\theta_{it}\) must be specified in the column mu
and
theta
of the argument table
before this function is called.
Tango, T., Takahashi, K. & Kohriyama, K. (2011), A space-time scan statistic for detecting emerging outbreaks, Biometrics 67(1), 106<U+2013>115.
# NOT RUN {
# Simple example
set.seed(1)
table <- scanstatistics:::create_table(list(location = 1:4, duration = 1:4),
keys = c("location", "duration"))
table[, mu := 3 * location]
table[, theta := 2]
table[, count := rnbinom(.N, mu = mu, size = theta)]
table[location %in% c(1, 4) & duration < 3,
count := rnbinom(.N, mu = 2 * mu, size = theta)]
zones <- scanstatistics:::powerset_zones(4)
result1 <- scan_negbin(table, zones, 100, "ordinary")
result2 <- scan_negbin(table, zones, 100, "increasing")
# }
Run the code above in your browser using DataLab