predIntNparSimultaneousTestPower: Probability That at Least One Set of Future Observations Violates the Given Rule Based on a Nonparametric Simultaneous Prediction Interval

Description

Compute the probability that at least one set of future observations violates the given rule based on a nonparametric simultaneous prediction interval for the next $r$ future sampling occasions. The three possible rules are: $k$-of-$m$, California, or Modified California. The probability is based on assuming the true distribution of the observations is normal.

Usage

predIntNparSimultaneousTestPower(n, n.median = 1, k = 1, m = 2, r = 1, 
    rule = "k.of.m", lpl.rank = ifelse(pi.type == "upper", 0, 1), 
    n.plus.one.minus.upl.rank = ifelse(pi.type == "lower", 0, 1), 
    delta.over.sigma = 0, pi.type = "upper", r.shifted = r,  
    method = "approx", NMC = 100, ci = FALSE, ci.conf.level = 0.95, 
    integrate.args.list = NULL)

Arguments

vector of positive integers specifying the sample sizes. Missing (NA), undefined (NaN), and infinite (Inf, -Inf) values are not allowed.

n.median

vector of positive odd integers specifying the sample size associated with the future medians. The default value is n.median=1 (i.e., individual observations). Note that all future medians must be based on the same sample size.

for the $k$-of-$m$ rule (rule="k.of.m"), a vector of positive integers specifying the minimum number of observations (or medians) out of $m$ observations (or medians) (all obtained on one future sampling occassion)

vector of positive integers specifying the maximum number of future observations (or medians) on one future sampling occasion. The default value is m=2, except when rule="Modified.CA", in which case

vector of positive integers specifying the number of future sampling occasions. The default value is r=1.

rule

character string specifying which rule to use. The possible values are "k.of.m" ($k$-of-$m$ rule; the default), "CA" (California rule), and "Modified.CA" (modified California rule).

lpl.rank

vector of non-negative integers indicating the rank of the order statistic to use for the lower bound of the prediction interval. When pi.type="lower", the default value is lpl.rank=1 (implying the minimum value of

n.plus.one.minus.upl.rank

vector of non-negative integers related to the rank of the order statistic to use for the upper bound of the prediction interval. A value of n.plus.one.minus.upl.rank=1 means use the first largest value, and in general a value of

delta.over.sigma

numeric vector indicating the ratio $\Delta/\sigma$. The quantity $\Delta$ (delta) denotes the difference between the mean of the population that was sampled to construct the prediction interval, and the mean of the population that will be sa

pi.type

character string indicating what kind of prediction interval to compute. The possible values are "two.sided" (the default), "lower", and "upper".

r.shifted

vector of positive integers specifying the number of future sampling occasions for which the scaled mean is shifted by $\Delta/\sigma$. All values must be integeters between 1 and the corresponding element of r. The

method

character string indicating what method to use to compute the power. The possible values are "approx" (approximation based on predIntNormSimultaneousTestPower;

NMC

positive integer indicating the number of Monte Carlo trials to run when method="simulate". The default value is NMC=100.

logical scalar indicating whether to compute a confidence interval for the power when method="simulate". The default value is ci=FALSE.

ci.conf.level

numeric scalar between 0 and 1 indicating the confidence level associated with the confidence interval for the power. The argument is ignored if ci=FALSE or method="approx".

integrate.args.list

list of arguments to supply to the integrate function. The default value is NULL.

Value

vector of values between 0 and 1 equal to the probability that the rule will be violated.

Details

What is a Nonparametric Simultaneous Prediction Interval? A nonparametric prediction interval for some population is an interval on the real line constructed so that it will contain at least $k$ of $m$ future observations from that population with some specified probability $(1-\alpha)100%$, where $0 < \alpha < 1$ and $k$ and $m$ are some pre-specified positive integers and $k \le m$. The quantity $(1-\alpha)100%$ is called the confidence coefficient or confidence level associated with the prediction interval. The function predIntNpar computes a standard nonparametric prediction interval. The function predIntNparSimultaneous computes a nonparametric simultaneous prediction interval that will contain a certain number of future observations with probability $(1-\alpha)100%$ for each of $r$ future sampling occasions, where $r$ is some pre-specified positive integer. The quantity $r$ may refer to $r$ distinct future sampling occasions in time, or it may for example refer to sampling at $r$ distinct locations on one future sampling occasion, assuming that the population standard deviation is the same at all of the $r$ distinct locations. The function predIntNparSimultaneous computes a nonparametric simultaneous prediction interval based on one of three possible rules:

For the$k$-of-$m$rule (rule="k.of.m"), at least$k$of the next$m$future observations will fall in the prediction interval with probability$(1-\alpha)100%$on each of the$r$future sampling occasions. If obserations are being taken sequentially, for a particular sampling occasion, up to$m$observations may be taken, but once$k$of the observations fall within the prediction interval, sampling can stop. Note: For this rule, when$r=1$, the results ofpredIntNparSimultaneousare equivalent to the results ofpredIntNpar.
For the California rule (rule="CA"), with probability$(1-\alpha)100%$, for each of the$r$future sampling occasions, either the first observation will fall in the prediction interval, or else all of the next$m-1$observations will fall in the prediction interval. That is, if the first observation falls in the prediction interval then sampling can stop. Otherwise,$m-1$more observations must be taken.
For the Modified California rule (rule="Modified.CA"), with probability$(1-\alpha)100%$, for each of the$r$future sampling occasions, either the first observation will fall in the prediction interval, or else at least 2 out of the next 3 observations will fall in the prediction interval. That is, if the first observation falls in the prediction interval then sampling can stop. Otherwise, up to 3 more observations must be taken.

Nonparametric simultaneous prediction intervals can be extended to using medians in place of single observations (USEPA, 2009, Chapter 19). That is, you can create a nonparametric simultaneous prediction interval that will contain a specified number of medians (based on which rule you choose) on each of $r$ future sampling occassions, where each each median is based on $b$ individual observations. For the function predIntNparSimultaneous, the argument n.median corresponds to $b$. The Form of a Nonparametric Prediction Interval Let $\underline{x} = x_1, x_2, \ldots, x_n$ denote a vector of $n$ independent observations from some continuous distribution, and let $x_{(i)}$ denote the the $i$'th order statistics in $\underline{x}$. A two-sided nonparametric prediction interval is constructed as: $$[x_{(u)}, x_{(v)}] \;\;\;\;\;\; (1)$$ where $u$ and $v$ are positive integers between 1 and $n$, and $u < v$. That is, $u$ denotes the rank of the lower prediction limit, and $v$ denotes the rank of the upper prediction limit. To make it easier to write some equations later on, we can also write the prediction interval (1) in a slightly different way as: $$[x_{(u)}, x_{(n + 1 - w)}] \;\;\;\;\;\; (2)$$ where $$w = n + 1 - v \;\;\;\;\;\; (3)$$ so that $w$ is a positive integer between 1 and $n-1$, and $u < n+1-w$. In terms of the arguments to the function predIntNparSimultaneous, the argument lpl.rank corresponds to $u$, and the argument n.plus.one.minus.upl.rank corresponds to $w$. If we allow $u=0$ and $w=0$ and define lower and upper bounds as: $$x_{(0)} = lb \;\;\;\;\;\; (4)$$ $$x_{(n+1)} = ub \;\;\;\;\;\; (5)$$ then Equation (2) above can also represent a one-sided lower or one-sided upper prediction interval as well. That is, a one-sided lower nonparametric prediction interval is constructed as: $$[x_{(u)}, x_{(n + 1)}] = [x_{(u)}, ub] \;\;\;\;\;\; (6)$$ and a one-sided upper nonparametric prediction interval is constructed as: $$[x_{(0)}, x_{(n + 1 - w)}] = [lb, x_{(n + 1 - w)}] \;\;\;\;\;\; (7)$$ Usually, $lb = -\infty$ or $lb = 0$ and $ub = \infty$. Note: For nonparametric simultaneous prediction intervals, only lower (pi.type="lower") and upper (pi.type="upper") prediction intervals are available. Computing Power The "power" of the prediction interval is defined as the probability that at least one set of future observations violates the given rule based on a simultaneous prediction interval for the next $r$ future sampling occasions, where the population for the future observations is allowed to differ from the population for the observations used to construct the prediction interval. For the function predIntNparSimultaneousTestPower, power is computed assuming both the background and future the observations come from normal distributions with the same standard deviation, but the means of the distributions are allowed to differ. The quantity $\Delta$ (upper case delta) denotes the difference between the mean of the population that was sampled to construct the prediction interval, and the mean of the population that will be sampled to produce the future observations. The quantity $\sigma$ (sigma) denotes the population standard deviation of both of these populations. The argument delta.over.sigma corresponds to the quantity $\Delta/\sigma$. Approximate Power (method="approx") Based on Gansecki (2009), the power of a nonparametric simultaneous prediction interval when the underlying observations come from a nomral distribution can be approximated by the power of a normal simultaneous prediction interval (see predIntNormSimultaneousTestPower) where the multiplier $K$ is replaced with the expected value of the normal order statistic that corresponds to the rank of the order statistic used for the upper or lower bound of the prediction interval. Gansecki (2009) uses the approximation: $$K = \Phi^{-1}(\frac{i - 0.5}{n}) \;\;\;\;\;\; (8)$$ where $\Phi$ denotes the cumulative distribution function of the standard normal distribution and $i$ denotes the rank of the order statistic used as the prediction limit. The function predIntNparSimultaneousTestPower uses the exact value of the expected value of the normal order statistic by calling the function evNormOrdStatsScalar. Power Based on Monte Carlo Simulation (method="simulate") When method="simulate", the power of the nonparametric simultaneous prediction interval is estimated based on a Monte Carlo simulation. The argument NMC determines the number of Monte Carlo trials. If ci=TRUE, a confidence interval for the power is created based on the NMC Monte Carlo estimates of power.

References

See the help file for predIntNparSimultaneous. Gansecki, M. (2009). Using the Optimal Rank Values Calculator. US Environmental Protection Agency, Region 8, March 10, 2009.

Examples

Run this code

# Example 19-5 of USEPA (2009, p. 19-33) shows how to compute nonparametric upper 
  # simultaneous prediction limits for various rules based on trace mercury data (ppb) 
  # collected in the past year from a site with four background wells and 10 compliance 
  # wells (data for two of the compliance wells  are shown in the guidance document).  
  # The facility must monitor the 10 compliance wells for five constituents 
  # (including mercury) annually.
  
  # Here we will compute the confidence levels and powers associated with  
  # two different sampling plans: 
  # 1) the 1-of-2 retesting plan for a median of order 3 using the 
  #    background maximum and 
  # 2) the 1-of-4 plan on individual observations using the 3rd highest 
  #    background value.
  # Power will be computed assuming a normal distribution and setting 
  # delta.over.sigma equal to 2, 3, and 4.
  # The data for this example are stored in EPA.09.Ex.19.5.mercury.df.

  # We will pool data from 4 background wells that were sampled on 
  # a number of different occasions, giving us a sample size of 
  # n = 20 to use to construct the prediction limit.

  # There are 10 compliance wells and we will monitor 5 different 
  # constituents at each well annually.  For this example, USEPA (2009) 
  # recommends setting r to the product of the number of compliance wells and 
  # the number of evaluations per year.  

  # To determine the minimum confidence level we require for 
  # the simultaneous prediction interval, USEPA (2009) recommends 
  # setting the maximum allowed individual Type I Error level per constituent to:
 
  # 1 - (1 - SWFPR)^(1 / Number of Constituents)
  
  # which translates to setting the confidence limit to 

  # (1 - SWFPR)^(1 / Number of Constituents)

  # where SWFPR = site-wide false positive rate.  For this example, we 
  # will set SWFPR = 0.1.  Thus, the required individual Type I Error level 
  # and confidence level per constituent are given as follows:

  # n  = 20 based on 4 Background Wells
  # nw = 10 Compliance Wells
  # nc =  5 Constituents
  # ne =  1 Evaluation per year

  n  <- 20
  nw <- 10
  nc <-  5
  ne <-  1

  # Set number of future sampling occasions r to 
  # Number Compliance Wells x Number Evaluations per Year
  r  <-  nw * ne

  conf.level <- (1 - 0.1)^(1 / nc)
  conf.level
  #[1] 0.9791484

  # So the required confidence level is 0.98, or 98%.
  # Now determine the confidence level associated with each plan.
  # Note that both plans achieve the required confidence level.
 
  # 1) the 1-of-2 retesting plan for a median of order 3 using the 
  #    background maximum

  predIntNparSimultaneousConfLevel(n = 20, n.median = 3, k = 1, m = 2, r = r)
  #[1] 0.9940354


  # 2) the 1-of-4 plan based on individual observations using the 3rd highest 
  #    background value.

  predIntNparSimultaneousConfLevel(n = 20, k = 1, m = 4, r = r, 
    n.plus.one.minus.upl.rank = 3)
  #[1] 0.9864909

  #------------------------------------------------------------------------------
  # Compute approximate power of each plan to detect contamination at just 1 well 
  # assuming true underying distribution of Hg is Normal at all wells and 
  # using delta.over.sigma equal to 2, 3, and 4.
  #------------------------------------------------------------------------------

  # Computer aproximate power for 
  # 1) the 1-of-2 retesting plan for a median of order 3 using the 
  #    background maximum

  predIntNparSimultaneousTestPower(n = 20, n.median = 3, k = 1, m = 2, r = r, 
    delta.over.sigma = 2:4, r.shifted = 1)
  #[1] 0.3953712 0.9129671 0.9983054


  # Compute approximate power for
  # 2) the 1-of-4 plan based on individual observations using the 3rd highest 
  #    background value.

  predIntNparSimultaneousTestPower(n = 20, k = 1, m = 4, r = r, 
    n.plus.one.minus.upl.rank = 3, delta.over.sigma = 2:4, r.shifted = 1)
  #[1] 0.4367972 0.8694664 0.9888779


  #----------

  # Compare estimated power using approximation method with estimated power
  # using Monte Carlo simulation for the 1-of-4 plan based on individual 
  # observations using the 3rd highest background value.

  predIntNparSimultaneousTestPower(n = 20, k = 1, m = 4, r = r, 
    n.plus.one.minus.upl.rank = 3, delta.over.sigma = 2:4, r.shifted = 1, 
    method = "simulate", ci = TRUE, NMC = 1000)
  #[1] 0.437 0.863 0.989
  #attr(,"conf.int")
  #         [,1]      [,2]      [,3]
  #LCL 0.4111999 0.8451148 0.9835747
  #UCL 0.4628001 0.8808852 0.9944253
 
  #==========

  # Cleanup
  #--------
  rm(n, nw, nc, ne, r, conf.level)

Run the code above in your browser using DataLab