ctbi.outlier: ctbi.outlier

Description

Please cite the following companion paper if you're using the ctbi package: Ritter, F.: Technical note: A procedure to clean, decompose, and aggregate time series, Hydrol. Earth Syst. Sci., 27, 349–361, https://doi.org/10.5194/hess-27-349-2023, 2023.

Outliers in an univariate dataset y are flagged using an enhanced box plot rule (called Logbox, input: coeff.outlier) that is adapted to non-Gaussian data and keeps the type I error at \(\frac{0.1}{\sqrt{n}}\) % (percentage of erroneously flagged outliers).

The box plot rule flags data points as outliers if they are below \(L\) or above \(U\) using the sample quantile \(q\):

\(L = q(0.25)-\alpha \times (q(0.75)- q(0.25))\)

\(U = q(0.75)+\alpha \times (q(0.75)- q(0.25))\)

Logbox replaces the original \(\alpha = 1.5\) constant of the box plot rule with \(\alpha = A \times \log(n)+B+\frac{C}{n}\). The variable \(n \geq 9\) is the sample size, \(C = 36\) corrects biases emerging in small samples, and \(A\) and \(B\) are automatically calculated on a predictor of the maximum tail weight defined as \(m_{*} = \max(m_{-},m_{+})-0.6165\).

The two functions (\(m_{-}\),\(m_{+}\)) are defined as:

\(m_{-} = \frac{q(0.875)- q(0.625)}{q(0.75)- q(0.25)}\)

\(m_{+} = \frac{q(0.375)- q(0.125)}{q(0.75)- q(0.25)}\)

And finally, \(A = f_{A}(\)\(m_{*}\)\()\) and \(B = f_{B}(\)\(m_{*}\)\()\) with \(m_{*}\) restricted to [0,2]. The functions \((f_{A},f_{B})\) are defined as:

\(f_{A}(x) = 0.2294\exp(2.9416x-0.0512x^{2}-0.0684x^{3})\)

\(f_{B}(x) = 1.0585+15.6960x-17.3618x^{2}+28.3511x^{3}-11.4726x^{4}\)

Both functions have been calibrated on the Generalized Extreme Value and Pearson families.

Usage

ctbi.outlier(y, coeff.outlier = "auto")

Value

A list that contains:

xy, a two columns data frame that contains the clean data (first column) and the outliers (second column)

summary.outlier, a vector that contains A, B, C, \(m_{*}\), the size of the residuals (n), and the lower and upper outlier threshold

Arguments

y: univariate data (numeric vector)
coeff.outlier: one of coeff.outlier = 'auto' (default value), coeff.outlier = 'gaussian', coeff.outlier = c(A,B,C) or coeff.outlier = NA. If coeff.outlier = 'auto', C = 36 and the coefficients A and B are calculated on \(m_{*}\). If coeff.outlier = 'gaussian', coeff.outlier = c(0.08,2,36), adapted to the Gaussian distribution. If coeff.outlier = NA, no outliers are flagged

Examples

Run this code

x <- runif(30)
x[c(5,10,20)] <- c(-10,15,30)
example1 <- ctbi.outlier(x)

Run the code above in your browser using DataLab