MGBT: Multiple Grubbs--Beck Test (MGBT) for Low Outliers

Description

Perform the Multiple Grubbs--Beck Test (MGBT; Cohn and others, 2013) for low outliers (LOTs, low-outlier threshold; potentially influential low floods, PILFs) that is implemented in the USGS-PeakFQ software (USGS, 2014; Veilleux and others, 2014) for implementation of Bulletin 17C (B17C) (England and others, 2018). The test internally transforms the data to logarithms (base-10) and thus is oriented for positively distributed data but accommodates zeros in the dataset.

The essence of the MGBT, given the order statistics $x_{[1:n]} \le x_{[2:n]} \le \cdots \le x_{[(n-1):n]} \le x_{[n:n]}$, is the statistic $$GB_r = \omega_r = \frac{ x_{[r:n]} - \mathrm{mean}\{x_{[(r+1)\rightarrow n:n]}\} } {\sqrt{\mathrm{var}\{x_{[(r+1)\rightarrow n:n]}\}}}\mbox{,} $$ which is can be computed by MGBTcohn2011 that is a port a function of TAC's used in a testing script that is reproduced in the Examples of RSlo. Variations of this pseudo-standardization scheme are shown for BLlo and RSlo. Also, $GB_r$ is the canonical form of the variable eta in TAC sources and peta=peta will be its associated probability.

Usage

MGBT(...) # A wrapper on MGBT17C()---This is THE function for end users.
     MGBT17c(x, alphaout=0.005, alphain=0, alphazeroin=0.10,
                n2=floor(length(x)/2), napv.zero=TRUE, offset=0, min.obs=0)
MGBT17c.verb(x, alphaout=0.005, alphain=0, alphazeroin=0.10,
                n2=floor(length(x)/2), napv.zero=TRUE, offset=0, min.obs=0)
MGBTcohn2016(x, alphaout=0.005, alphazeroin=0.10, n2=floor(length(x)/2),
                napv.zero=TRUE, offset=0)
MGBTcohn2013(x, alphaout=0.005, alphazeroin=0.10, n2=floor(length(x)/2),
                napv.zero=TRUE, offset=0)
      MGBTnb(x, alphaout=0.005, alphazeroin=0.10, n2=floor(length(x)/2),
                napv.zero=TRUE, offset=0)
MGBTcohn2011(x, r=NULL, n=length(x)) # only computes the GB_r, not a test

Arguments

...

Arguments to pass to the MGBT family of functions;

The data values and note that base-10 logarithms of these are computed internally except for the operation of the MGBTcohn2011 function, which does not (see Examples for RSlo). Also protection from zero or negative values is made by the R function pmax, and these values are replaced with a “small” value of 1e-8 and tacitly TAC has assumed that p-values for these will be significantly small and truncated away;

alphaout

Literally the $\alpha_\mathrm{out}$ of Bulletin 17C. This is the significance level of the “sweep out” portion of MGBT;

alphain

This is the significance level of the “sweep in” portion of MGBT but starts at one plus the order statistic identified by alphaout;

alphazeroin

Literally the $\alpha_\mathrm{in}$ of Bulletin 17C. This is the significance level of the “sweep in” portion of MGBT;

napv.zero

A logical switch to reset a returned NA from RthOrderPValueOrthoT to zero. This is a unique extension by WHA based on large-scale batch testing of the USGS peak-values database (see Note). This being said, the fall-back to Monte Carlo integration if the numerical integration fails, seems to mostly make this argument superfluous;

offset

The offset, if not NA, is added from the threshold unless the threshold itself is already zero. In practical application, this offset, if set, would likely be a negative quantity. This argument is a unique extension by WHA;

min.obs

The minimum number of observations. This option is provided to streamline larger applications, but the underlying logic in MGBT17C is robust and on failures because of small sizes return a threshold of 0 anyway;

The number of n2-smallest values to be evaluated in the MGBT;

The number of truncated observations, which can be though of the rth order statistic and below; and

The number of observations. It is not clear that TAC intended n to be not equal to the sample size but TAC chose to not keep the length of x as determined internally to the function but to have it also available as an argument. Functions BLlo and RSlo also were designed similarly.

Value

The MGBT results as an R list:

index

The sample size $n$, the value for n2, and the three indices of the “sweep out,” “sweep in,” and “sweep in from zero” processing (only for MGBT17c as this is an extension from TAC);

omegas

The $GB_r = \omega_r$ statistics for which the p-values in pvalues are shown. These are mostly returned for aid in debugging and verification of the algorithms;

The n2-smallest values in increasing order (only for MGBT17c as this is an extension from TAC);

pvalues

The p-values of the n2-smallest values of the sample (not available for MGBT17c because of algorithm design for speed);

klow

The number of low outliers detected;

LOThresh

The low-outlier threshold for the klow+1 index of the sample (and possibly adjusted by the offset) or simply zero; and

message

Possibly message in event of some internal difficulty.

The inclusion of x in the returned value is to add symmetry because the p-values are present. The inclusion of n and n2 might make percentage computations of inward and outward sweep indices useful in exploratory analyses. Finally, the inclusion of the sweep indices is important as it was through inspection of these that the problems in TAC sources were discovered.

References

Asquith, W.H., 2019, lmomco---L-moments, trimmed L-moments, L-comoments, censored L-moments, and many distributions: R package version 2.3.2 (September 20, 2018), accessed March 30, 2019, at https://cran.r-project.org/package=lmomco.

Cohn, T.A., 2013--2016, Personal communication of original R source code: U.S. Geological Survey, Reston, Va.

Cohn, T.A., England, J.F., Berenbrock, C.E., Mason, R.R., Stedinger, J.R., and Lamontagne, J.R., 2013, A generalized Grubbs-Beck test statistic for detecting multiple potentially influential low outliers in flood series: Water Resources Research, v. 49, no. 8, pp. 5047--5058.

England, J.F., Cohn, T.A., Faber, B.A., Stedinger, J.R., Thomas Jr., W.O., Veilleux, A.G., Kiang, J.E., and Mason, R.R., 2018, Guidelines for determining flood flow frequency Bulletin 17C: U.S. Geological Survey Techniques and Methods, book 4, chap. 5.B, 148 p., https://doi.org/10.3133/tm4B5

U.S. Geological Survey (USGS), 2018, PeakFQ---Flood frequency analysis based on Bulletin 17B and recommendations of the Advisory Committee on Water Information (ACWI) Subcommittee on Hydrology (SOH) Hydrologic Frequency Analysis Work Group (HFAWG), version 7.2: Accessed November 29, 2018, at https://water.usgs.gov/software/PeakFQ/.

Veilleux, A.G., Cohn, T.A., Flynn, K.M., Mason, R.R., Jr., and Hummel, P.R., 2014, Estimating magnitude and frequency of floods using the PeakFQ 7.0 program: U.S. Geological Survey Fact Sheet 2013--3108, 2 p., https://dx.doi.org/10.3133/fs20133108.

Examples

Run this code

# NOT RUN {
# USGS 08066300 (1966--2016) # cubic feet per second (cfs)
#https://nwis.waterdata.usgs.gov/nwis/peak?site_no=08066300&format=hn2
Values <- c(3530, 284, 1810, 9660,  489,  292, 1000,  2640, 2910, 1900,  1120, 1020,
   632, 7160, 1750,  2730,  1630, 8210, 4270, 1730, 13200, 2550,  915, 11000, 2370,
  2230, 4650, 2750,  1860, 13700, 2290, 3390, 5160, 13200,  410, 1890,  4120, 3930,
  4290, 1890, 1480, 10300,  1190, 2320, 2480, 55.0,  7480,  351,  738,  2430, 6700)
MGBT(Values) # Results LOT=284 cfs leaving 55.0 cfs (p-value=0.0119) censored.
#$index
#      n      n2    ix_alphaout     ix_alphain   ix_alphazeroin
#     51      25              0              0                1
#$omegas
# [1] -3.781980 -2.268554 -2.393569 -2.341027 -2.309990 -2.237571
# [7] -2.028614 -1.928391 -1.720404 -1.673523 -1.727138 -1.671534
#[13] -1.661346 -1.391819 -1.293324 -1.246974 -1.276485 -1.272878
#[19] -1.280917 -1.310286 -1.372402 -1.434898 -1.226588 -1.237743
#[25] -1.276794
#$x
# [1]   55  284  292  351  410  489  632  738  915 1000 1020 1120 1190 1480 1630 1730
#[17] 1750 1810 1860 1890 1890 1900 2230 2290 2320
#$pvalues
# [1] 0.01192184 0.30337879 0.08198836 0.04903091 0.02949836 0.02700114 0.07802324
# [8] 0.11185553 0.31531749 0.34257170 0.21560086 0.25950150 0.24113157 0.72747052
#[15] 0.86190920 0.89914152 0.84072131 0.82381908 0.78750571 0.70840262 0.55379730
#[22] 0.40255392 0.79430336 0.75515103 0.66031442
#$LOThresh
#[1] 284

# The USGS-PeakFQ (v7.1) software reports:
#   EMA003I-PILFS (LOS) WERE DETECTED USING MULTIPLE GRUBBS-BECK TEST   1     284.0
#      THE FOLLOWING PEAKS (WITH CORRESPONDING P-VALUES) WERE CENSORED:
#            55.0    (0.0123)
# As a curiosity, see Examples under ASlo().#
# }
# NOT RUN {
# }
# NOT RUN {
# MGBTnb() has a sweep in problem.
SweepIn <- c(1, 1, 3200, 5270, 26300, 38400, 8710, 23200, 39300, 27800, 21000,
  21000, 21500, 57000, 53700, 5720, 10700, 4050, 4890, 10500, 26300, 16600, 20900,
  21400, 10800, 8910, 6360)  # sweep in and out both identify index 2.
MGBT17c(SweepIn, alphaout=0)$LOThres # LOT = 3200 # force no sweep outs
MGBTnb(SweepIn)$LOThres              # LOT = 3200 # because sweep out is the same!
MGBTnb(SweepIn, alphaout=0) # LOT = 1  # force no sweep outs, it fails.
# }