rosnerTest: Rosner's Test for Outliers

Description

Perform Rosner's test for up to $k$ potential outliers in a dataset assuming the data without any outliers come from a normal (Gaussian) distribution.

Usage

rosnerTest(x, k = 3, alpha = 0.05, warn = TRUE)

Arguments

numeric vector of observations. Missing (NA), undefined (NaN), and infinite (Inf, -Inf) values are allowed but will be removed. There must be at least 10 non-missing, finite observations in

positive integer indicating the number of suspected outliers. The argument k must be between 1 and $n-2$ where $n$ denotes the number of non-missing, finite values in the arguemnt x. The default value is k=3

alpha

numeric scalar between 0 and 1 indicating the Type I error associated with the test of hypothesis. The default value is alpha=0.05.

warn

logical scalar indicating whether to issue a warning (warn=TRUE; the default) when the number of non-missing, finite values in x is less than 25. See the DETAILS section below.

Value

A list of class "gofOutlier" containing the results of the hypothesis test. See the help file for gofOutlier.object for details.

Details

Let $x_1, x_2, \ldots, x_n$ denote the $n$ observations. We assume that $n-k$ of these observations come from the same normal (Gaussian) distribution, and that the $k$ most extreme observations may or may not represent observations from a different distribution. Let $x^{*}_1, x^{*}_2, \ldots, x^{*}_{n-i}$ denote the $n-i$ observations left after omiting the $i$ most extreme observations, where $i = 0, 1, \ldots, k-1$. Let $\bar{x}^{(i)}$ and $s^{(i)}$ denote the mean and standard deviation, respectively, of the $n-i$ observations in the data that remain after removing the $i$ most extreme observations. Thus, $\bar{x}^{(0)}$ and $s^{(0)}$ denote the mean and standard deviation for the full sample, and in general $$\bar{x}^{(i)} = \frac{1}{n-i}\sum_{j=1}^{n-i} x^{*}_j \;\;\;\;\;\; (1)$$ $$s^{(i)} = \sqrt{\frac{1}{n-i-1} \sum_{j=1}^{n-i} (x^{*}_j - \bar{x}^{(i)})^2} \;\;\;\;\;\; (2)$$ For a specified value of $i$, the most extreme observation $x^{(i)}$ is the one that is the greatest distance from the mean for that data set, i.e., $$x^{(i)} = \max_{j=1,2,\ldots,n-i} |x^{*}_j - \bar{x}^{(i)}| \;\;\;\;\;\; (3)$$ Thus, an extreme observation may be the smallest or the largest one in that data set. Rosner's test is based on the $k$ statistics $R_1, R_2, \ldots, R_k$, which represent the extreme Studentized deviates computed from successively reduced samples of size $n, n-1, \ldots, n-k+1$: $$R_{i+1} = \frac{|x^{(i)} - \bar{x}^{(i)}|}{s^{(i)}} \;\;\;\;\;\; (4)$$ Critical values for $R_{i+1}$ are denoted $\lambda_{i+1}$ and are computed as: $$\lambda_{i+1} = \frac{t_{p, n-i-2} (n-i-1)}{\sqrt{(n-i-2 + t_{p, n-i-2}) (n-i)}} \;\;\;\;\;\; (5)$$ where $t_{p, \nu}$ denotes the $p$'th quantile of Student's t-distribution with $\nu$ degrees of freedom, and in this case $$p = 1 - \frac{\alpha/2}{n - i} \;\;\;\;\;\; (6)$$ where $\alpha$ denotes the Type I error level. The algorithm for determining the number of outliers is as follows:

Compare$R_k$with$\lambda_k$. If$R_k > \lambda_k$then conclude the$k$most extreme values are outliers.
If$R_k \le \lambda_k$then compare$R_{k-1}$with$\lambda_{k-1}$. If$R_{k-1} > \lambda_{k-1}$then conclude the$k-1$most extreme values are outliers.
Continue in this fashion until a certain number of outliers have been identified or Rosner's test finds no outliers at all.

Rosner (1983) shows that the true Type I error is larger than assumed for the case when $n < 25$ and $k > 1$. When this is the case and warn=TRUE, a warning is issued.

References

Barnett, V., and T. Lewis. (1995). Outliers in Statistical Data. Third Edition. John Wiley & Sons, Chichester, UK, pp. 235--236. Gilbert, R.O. (1987). Statistical Methods for Environmental Pollution Monitoring. Van Nostrand Reinhold, NY, pp.188--191. McBean, E.A, and F.A. Rovers. (1992). Estimation of the Probability of Exceedance of Contaminant Concentrations. Ground Water Monitoring Review Winter, pp. 115--119. McNutt, M. (2014). Raising the Bar. Science 345(6192), p. 9. Rosner, B. (1983). Percentage Points for a Generalized ESD Many-Outlier Procedure. Technometrics 25, 165--172. USEPA. (2006). Data Quality Assessment: A Reviewer's Guide. EPA QA/G-9R. EPA/240/B-06/002, February 2006. Office of Environmental Information, U.S. Environmental Protection Agency, Washington, D.C. USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C., pp. 12-10 to 12-14. USEPA. (2013). ProUCL Version 5.0.00 Technical Guide. EPA/600/R-07/041, September 2013. Office of Research and Development. U.S. Environmental Protection Agency, Washington, D.C., pp. 190--195.

Examples

Run this code

# Combine 30 observations from a normal distribution with mean 3 and 
  # standard deviation 2, with 3 observations from a normal distribution 
  # with mean 10 and standard deviation 1, then run Rosner's Test on these 
  # data, specifying k=4 potential outliers based on looking at the 
  # normal Q-Q plot. 
  # (Note: the call to set.seed simply allows you to reproduce 
  # this example.)

  set.seed(250) 

  dat <- c(rnorm(30, mean = 3, sd = 2), rnorm(3, mean = 10, sd = 1)) 

  dev.new()
  qqPlot(dat)

  rosnerTest(dat, k = 4)

  #Results of Outlier Test
  #-------------------------
  #
  #Test Method:                     Rosner's Test for Outliers
  #
  #Hypothesized Distribution:       Normal
  #
  #Data:                            dat
  #
  #Sample Size:                     33
  #
  #Test Statistics:                 R.1 = 2.848514
  #                                 R.2 = 3.086875
  #                                 R.3 = 3.033044
  #                                 R.4 = 2.380235
  #
  #Test Statistic Parameter:        k = 4
  #
  #Alternative Hypothesis:          Up to 4 observations are not
  #                                 from the same Distribution.
  #
  #Type I Error:                    5%
  #
  #Number of Outliers Detected:     3
  #
  #  i   Mean.i     SD.i      Value Obs.Num    R.i+1 lambda.i+1 Outlier
  #1 0 3.549744 2.531011 10.7593656      33 2.848514   2.951949    TRUE
  #2 1 3.324444 2.209872 10.1460427      31 3.086875   2.938048    TRUE
  #3 2 3.104392 1.856109  8.7340527      32 3.033044   2.923571    TRUE
  #4 3 2.916737 1.560335 -0.7972275      25 2.380235   2.908473   FALSE

  #----------
  # Clean up

  rm(dat)
  graphics.off()

  #--------------------------------------------------------------------

  # Example 12-4 of USEPA (2009, page 12-12) gives an example of 
  # using Rosner's test to test for outliers in napthalene measurements (ppb)
  # taken at 5 background wells over 5 quarters.  The data for this example 
  # are stored in EPA.09.Ex.12.4.naphthalene.df.

  EPA.09.Ex.12.4.naphthalene.df
  #   Quarter Well Naphthalene.ppb
  #1        1 BW.1            3.34
  #2        2 BW.1            5.39
  #3        3 BW.1            5.74
  # ...
  #23       3 BW.5            5.53
  #24       4 BW.5            4.42
  #25       5 BW.5           35.45

  longToWide(EPA.09.Ex.12.4.naphthalene.df, "Naphthalene.ppb", "Quarter", "Well", 
    paste.row.name = TRUE)
  #          BW.1 BW.2  BW.3 BW.4  BW.5
  #Quarter.1 3.34 5.59  1.91 6.12  8.64
  #Quarter.2 5.39 5.96  1.74 6.05  5.34
  #Quarter.3 5.74 1.47 23.23 5.18  5.53
  #Quarter.4 6.88 2.57  1.82 4.43  4.42
  #Quarter.5 5.85 5.39  2.02 1.00 35.45


  # Look at Q-Q plots for both the raw and log-transformed data
  #------------------------------------------------------------

  dev.new()
  with(EPA.09.Ex.12.4.naphthalene.df, 
    qqPlot(Naphthalene.ppb, add.line = TRUE, 
      main = "Figure 12-6.  Naphthalene Probability Plot"))

  dev.new()
  with(EPA.09.Ex.12.4.naphthalene.df, 
    qqPlot(Naphthalene.ppb, dist = "lnorm", add.line = TRUE, 
      main = "Figure 12-7.  Log Naphthalene Probability Plot"))


  # Test for 2 potential outliers on the original scale:
  #-----------------------------------------------------

  with(EPA.09.Ex.12.4.naphthalene.df, rosnerTest(Naphthalene.ppb, k = 2))

  #Results of Outlier Test
  #-------------------------
  #
  #Test Method:                     Rosner's Test for Outliers
  #
  #Hypothesized Distribution:       Normal
  #
  #Data:                            Naphthalene.ppb
  #
  #Sample Size:                     25
  #
  #Test Statistics:                 R.1 = 3.930957
  #                                 R.2 = 4.160223
  #
  #Test Statistic Parameter:        k = 2
  #
  #Alternative Hypothesis:          Up to 2 observations are not
  #                                 from the same Distribution.
  #
  #Type I Error:                    5%
  #
  #Number of Outliers Detected:     2
  #
  #  i  Mean.i     SD.i Value Obs.Num    R.i+1 lambda.i+1 Outlier
  #1 0 6.44240 7.379271 35.45      25 3.930957   2.821681    TRUE
  #2 1 5.23375 4.325790 23.23      13 4.160223   2.801551    TRUE

  #----------
  # Clean up

  graphics.off()

Run the code above in your browser using DataLab