distChoose: Choose Best Fitting Distribution Based on Goodness-of-Fit Tests

Description

Perform a series of goodness-of-fit tests from a (possibly user-specified) set of candidate probability distributions to determine which probability distribution provides the best fit for a data set.

Usage

distChoose(y, ...)
# S3 method for formula
distChoose(y, data = NULL, subset,
  na.action = na.pass, ...)
# S3 method for default
distChoose(y, alpha = 0.05, method = "sw",
    choices = c("norm", "gamma", "lnorm"), est.arg.list = NULL,
    warn = TRUE, keep.data = TRUE, data.name = NULL,
    parent.of.data = NULL, subset.expression = NULL, ...)

Value

a list of class "distChoose" containing the results of the goodness-of-fit tests. Objects of class "distChoose" have a special printing method. See the help files for distChoose.object for details.

Arguments

y

an object containing data for the goodness-of-fit tests. In the default method, the argument y must be numeric vector of observations. In the formula method, y must be a formula of the form y ~ 1. Missing (NA), undefined (NaN), and infinite (Inf, -Inf) values are allowed but will be removed.

data

specifies an optional data frame, list or environment (or object coercible by as.data.frame to a data frame) containing the variables in the model. If not found in data, the variables are taken from environment(formula), typically the environment from which distChoose is called.

subset

specifies an optional vector specifying a subset of observations to be used.

na.action

specifies a function which indicates what should happen when the data contain NAs. The default is na.pass.

alpha

numeric scalar between 0 and 1 specifying the Type I error associated with each goodness-of-fit test. When method="proucl" the only allowed values for alpha are 0.01, 0.05, and 0.1. The default value is alpha=0.05.

method

character string defining which method to use. Possible values are:

"sw". Shapiro-Wilk; the default.
"sf". Shapiro-Francia.
"ppcc". Probability Plot Correlation Coefficient.
"proucl". ProUCL.

See the DETAILS section below.

choices

a character vector denoting the distribution abbreviations of the candidate distributions. See the help file for Distribution.df for a list of distributions and their abbreviations. The default value is choices=c("norm", "gamma", "lnorm"), indicating the Normal, Gamma, and Lognormal distributions.

This argument is ignored when method="proucl".

est.arg.list

a list containing one or more lists of arguments to be passed to the function(s) estimating the distribution parameters. The name(s) of the components of the list must be equal to or a subset of the values of the argument choices. For example, if choices=c("norm", "gamma", "lnorm"), setting
est.arg.list=list(gamma = list(method="bcmle")) indicates using the bias-corrected maximum-likelihood estimators of shape and scale for the gamma distribution (see the help file for egamma). See the help file Estimating Distribution Parameters for a list of estimating functions. The default value is est.arg.list=NULL so that all default values for the estimating functions are used.

When testing for some form of normality (i.e., Normal, Lognormal, Three-Parameter Lognormal, Zero-Modified Normal, or Zero-Modified Lognormal (Delta)), the estimated parameters are provided in the output merely for information, and the choice of the method of estimation has no effect on the goodness-of-fit test statistics or p-values.

This argument is ignored when method="proucl".

warn

logical scalar indicating whether to print a warning message when observations with NAs, NaNs, or Infs in y are removed. The default value is TRUE.

keep.data

logical scalar indicating whether to return the original data. The default value is keep.data=TRUE.

data.name

optional character string indicating the name of the data used for argument y.

parent.of.data

character string indicating the source of the data used for the goodness-of-fit test.

subset.expression

character string indicating the expression used to subset the data.

...

additional arguments affecting the goodness-of-fit test.

Author

Steven P. Millard (EnvStats@ProbStatInfo.com)

Details

The function distChoose returns a list with information on the goodness-of-fit tests for various distributions and which distribution appears to best fit the data based on the p-values from the goodness-of-fit tests. This function was written in order to compare ProUCL's way of choosing the best-fitting distribution (USEPA, 2015) with other ways of choosing the best-fitting distribution.

Method Based on Shapiro-Wilk, Shapiro-Francia, or Probability Plot Correlation Test
(method="sw", method="sf", or method="ppcc")

For each value of the argument choices, the function distChoose runs the goodness-of-fit test using the data in y assuming that particular distribution. For example, if
choices=c("norm", "gamma", "lnorm"), indicating the Normal, Gamma, and Lognormal distributions, and method="sw", then the usual Shapiro-Wilk test is performed for the Normal and Lognormal distributions, and the extension of the Shapiro-Wilk test is performed for the Gamma distribution (see the section Testing Goodness-of-Fit for Any Continuous Distribution in the help file for gofTest for an explanation of the latter). The distribution associated with the largest p-value is the chosen distribution. In the case when all p-values are less than the value of the argument alpha, the distribution “Nonparametric” is chosen.

Method Based on ProUCL Algorithm (method="proucl")

When method="proucl", the function distChoose uses the algorithm that ProUCL (USEPA, 2015) uses to determine the best fitting distribution. The candidate distributions are the Normal, Gamma, and Lognormal distributions. The algorithm used by ProUCL is as follows:

Perform the Shapiro-Wilk and Lilliefors goodness-of-fit tests for the Normal distribution, i.e., call the function gofTest with distribution = "norm", test="sw" and
distribution = "norm", test="lillie". If either or both of the associated p-values are greater than or equal to the user-supplied value of alpha, then choose the Normal distribution. Otherwise, proceed to the next step.
Perform the “ProUCL Anderson-Darling” and “ProUCL Kolmogorov-Smirnov” goodness-of-fit tests for the Gamma distribution, i.e., call the function gofTest with
distribution="gamma", test="proucl.ad.gamma" and
distribution="gamma", test="proucl.ks.gamma". If either or both of the associated p-values are greater than or equal to the user-supplied value of alpha, then choose the Gamma distribution. Otherwise, proceed to the next step.
Perform the Shapiro-Wilk and Lilliefors goodness-of-fit tests for the Lognormal distribution, i.e., call the function gofTest with distribution="lnorm", test="sw" and
distribution="lnorm", test="lillie". If either or both of the associated p-values are greater than or equal to the user-supplied value of alpha, then choose the Lognormal distribution. Otherwise, proceed to the next step.
If none of the goodness-of-fit tests above yields a p-value greater than or equal to the user-supplied value of alpha, then choose the “Nonparametric” distribution.

References

Birnbaum, Z.W., and F.H. Tingey. (1951). One-Sided Confidence Contours for Probability Distribution Functions. Annals of Mathematical Statistics 22, 592-596.

Blom, G. (1958). Statistical Estimates and Transformed Beta Variables. John Wiley and Sons, New York.

Conover, W.J. (1980). Practical Nonparametric Statistics. Second Edition. John Wiley and Sons, New York.

Dallal, G.E., and L. Wilkinson. (1986). An Analytic Approximation to the Distribution of Lilliefor's Test for Normality. The American Statistician 40, 294-296.

D'Agostino, R.B. (1970). Transformation to Normality of the Null Distribution of \(g1\). Biometrika 57, 679-681.

D'Agostino, R.B. (1971). An Omnibus Test of Normality for Moderate and Large Size Samples. Biometrika 58, 341-348.

D'Agostino, R.B. (1986b). Tests for the Normal Distribution. In: D'Agostino, R.B., and M.A. Stephens, eds. Goodness-of Fit Techniques. Marcel Dekker, New York.

D'Agostino, R.B., and E.S. Pearson (1973). Tests for Departures from Normality. Empirical Results for the Distributions of \(b2\) and \(\sqrt{b1}\). Biometrika 60(3), 613-622.

D'Agostino, R.B., and G.L. Tietjen (1973). Approaches to the Null Distribution of \(\sqrt{b1}\). Biometrika 60(1), 169-173.

Fisher, R.A. (1950). Statistical Methods for Research Workers. 11'th Edition. Hafner Publishing Company, New York, pp.99-100.

Gibbons, R.D., D.K. Bhaumik, and S. Aryal. (2009). Statistical Methods for Groundwater Monitoring, Second Edition. John Wiley & Sons, Hoboken.

Kendall, M.G., and A. Stuart. (1991). The Advanced Theory of Statistics, Volume 2: Inference and Relationship. Fifth Edition. Oxford University Press, New York.

Kim, P.J., and R.I. Jennrich. (1973). Tables of the Exact Sampling Distribution of the Two Sample Kolmogorov-Smirnov Criterion. In Harter, H.L., and D.B. Owen, eds. Selected Tables in Mathematical Statistics, Vol. 1. American Mathematical Society, Providence, Rhode Island, pp.79-170.

Kolmogorov, A.N. (1933). Sulla determinazione empirica di una legge di distribuzione. Giornale dell' Istituto Italiano degle Attuari 4, 83-91.

Marsaglia, G., W.W. Tsang, and J. Wang. (2003). Evaluating Kolmogorov's distribution. Journal of Statistical Software, 8(18). tools:::Rd_expr_doi("10.18637/jss.v008.i18").

Moore, D.S. (1986). Tests of Chi-Squared Type. In D'Agostino, R.B., and M.A. Stephens, eds. Goodness-of Fit Techniques. Marcel Dekker, New York, pp.63-95.

Pomeranz, J. (1973). Exact Cumulative Distribution of the Kolmogorov-Smirnov Statistic for Small Samples (Algorithm 487). Collected Algorithms from ACM ??, ???-???.

Royston, J.P. (1992a). Approximating the Shapiro-Wilk W-Test for Non-Normality. Statistics and Computing 2, 117-119.

Royston, J.P. (1992b). Estimation, Reference Ranges and Goodness of Fit for the Three-Parameter Log-Normal Distribution. Statistics in Medicine 11, 897-912.

Royston, J.P. (1992c). A Pocket-Calculator Algorithm for the Shapiro-Francia Test of Non-Normality: An Application to Medicine. Statistics in Medicine 12, 181-184.

Royston, P. (1993). A Toolkit for Testing for Non-Normality in Complete and Censored Samples. The Statistician 42, 37-43.

Ryan, T., and B. Joiner. (1973). Normal Probability Plots and Tests for Normality. Technical Report, Pennsylvannia State University, Department of Statistics.

Shapiro, S.S., and R.S. Francia. (1972). An Approximate Analysis of Variance Test for Normality. Journal of the American Statistical Association 67(337), 215-219.

Shapiro, S.S., and M.B. Wilk. (1965). An Analysis of Variance Test for Normality (Complete Samples). Biometrika 52, 591-611.

Smirnov, N.V. (1939). Estimate of Deviation Between Empirical Distribution Functions in Two Independent Samples. Bulletin Moscow University 2(2), 3-16.

Smirnov, N.V. (1948). Table for Estimating the Goodness of Fit of Empirical Distributions. Annals of Mathematical Statistics 19, 279-281.

Stephens, M.A. (1970). Use of the Kolmogorov-Smirnov, Cramer-von Mises and Related Statistics Without Extensive Tables. Journal of the Royal Statistical Society, Series B, 32, 115-122.

Stephens, M.A. (1986a). Tests Based on EDF Statistics. In D'Agostino, R. B., and M.A. Stevens, eds. Goodness-of-Fit Techniques. Marcel Dekker, New York.

USEPA. (2015). ProUCL Version 5.1.002 Technical Guide. EPA/600/R-07/041, October 2015. Office of Research and Development. U.S. Environmental Protection Agency, Washington, D.C.

Verrill, S., and R.A. Johnson. (1987). The Asymptotic Equivalence of Some Modified Shapiro-Wilk Statistics -- Complete and Censored Sample Cases. The Annals of Statistics 15(1), 413-419.

Verrill, S., and R.A. Johnson. (1988). Tables and Large-Sample Distribution Theory for Censored-Data Correlation Statistics for Testing Normality. Journal of the American Statistical Association 83, 1192-1197.

Weisberg, S., and C. Bingham. (1975). An Approximate Analysis of Variance Test for Non-Normality Suitable for Machine Calculation. Technometrics 17, 133-134.

Wilk, M.B., and S.S. Shapiro. (1968). The Joint Assessment of Normality of Several Independent Samples. Technometrics, 10(4), 825-839.

Zar, J.H. (2010). Biostatistical Analysis. Fifth Edition. Prentice-Hall, Upper Saddle River, NJ.

Examples

Run this code

  # Generate 20 observations from a gamma distribution with
  # parameters shape = 2 and scale = 3 and:
  #
  # 1) Call distChoose using the Shapiro-Wilk method.
  #
  # 2) Call distChoose using the Shapiro-Wilk method and specify
  #    the bias-corrected method of estimating shape for the Gamma
  #    distribution.
  #
  # 3) Compare the results in 2) above with the results using the
  #    ProUCL method.
  #
  # Notes:  The call to set.seed lets you reproduce this example.
  #
  #         The ProUCL method chooses the Normal distribution, whereas the
  #         Shapiro-Wilk method chooses the Gamma distribution.

  set.seed(47)
  dat <- rgamma(20, shape = 2, scale = 3)


  # 1) Call distChoose using the Shapiro-Wilk method.
  #--------------------------------------------------

  distChoose(dat)

  #Results of Choosing Distribution
  #--------------------------------
  #
  #Candidate Distributions:         Normal
  #                                 Gamma
  #                                 Lognormal
  #
  #Choice Method:                   Shapiro-Wilk
  #
  #Type I Error per Test:           0.05
  #
  #Decision:                        Gamma
  #
  #Estimated Parameter(s):          shape = 1.909462
  #                                 scale = 4.056819
  #
  #Estimation Method:               MLE
  #
  #Data:                            dat
  #
  #Sample Size:                     20
  #
  #Test Results:
  #
  #  Normal
  #    Test Statistic:              W = 0.9097488
  #    P-value:                     0.06303695
  #
  #  Gamma
  #    Test Statistic:              W = 0.9834958
  #    P-value:                     0.970903
  #
  #  Lognormal
  #    Test Statistic:              W = 0.9185006
  #    P-value:                     0.09271768

  #--------------------

  # 2) Call distChoose using the Shapiro-Wilk method and specify
  #    the bias-corrected method of estimating shape for the Gamma
  #    distribution.
  #---------------------------------------------------------------

  distChoose(dat, method = "sw",
    est.arg.list = list(gamma = list(method = "bcmle")))

  #Results of Choosing Distribution
  #--------------------------------
  #
  #Candidate Distributions:         Normal
  #                                 Gamma
  #                                 Lognormal
  #
  #Choice Method:                   Shapiro-Wilk
  #
  #Type I Error per Test:           0.05
  #
  #Decision:                        Gamma
  #
  #Estimated Parameter(s):          shape = 1.656376
  #                                 scale = 4.676680
  #
  #Estimation Method:               Bias-Corrected MLE
  #
  #Data:                            dat
  #
  #Sample Size:                     20
  #
  #Test Results:
  #
  #  Normal
  #    Test Statistic:              W = 0.9097488
  #    P-value:                     0.06303695
  #
  #  Gamma
  #    Test Statistic:              W = 0.9834346
  #    P-value:                     0.9704046
  #
  #  Lognormal
  #    Test Statistic:              W = 0.9185006
  #    P-value:                     0.09271768

  #--------------------

  # 3) Compare the results in 2) above with the results using the
  #    ProUCL method.
  #---------------------------------------------------------------

  distChoose(dat, method = "proucl")

  #Results of Choosing Distribution
  #--------------------------------
  #
  #Candidate Distributions:         Normal
  #                                 Gamma
  #                                 Lognormal
  #
  #Choice Method:                   ProUCL
  #
  #Type I Error per Test:           0.05
  #
  #Decision:                        Normal
  #
  #Estimated Parameter(s):          mean = 7.746340
  #                                 sd   = 5.432175
  #
  #Estimation Method:               mvue
  #
  #Data:                            dat
  #
  #Sample Size:                     20
  #
  #Test Results:
  #
  #  Normal
  #    Shapiro-Wilk GOF
  #      Test Statistic:            W = 0.9097488
  #      P-value:                   0.06303695
  #    Lilliefors (Kolmogorov-Smirnov) GOF
  #      Test Statistic:            D = 0.1547851
  #      P-value:                   0.238092
  #
  #  Gamma
  #    ProUCL Anderson-Darling Gamma GOF
  #      Test Statistic:            A = 0.1853826
  #      P-value:                   >= 0.10
  #    ProUCL Kolmogorov-Smirnov Gamma GOF
  #      Test Statistic:            D = 0.0988692
  #      P-value:                   >= 0.10
  #
  #  Lognormal
  #    Shapiro-Wilk GOF
  #      Test Statistic:            W = 0.9185006
  #      P-value:                   0.09271768
  #    Lilliefors (Kolmogorov-Smirnov) GOF
  #      Test Statistic:            D = 0.149317
  #      P-value:                   0.2869177

  #--------------------

  # Clean up
  #---------

  rm(dat)

  #====================================================================

  # Example 10-2 of USEPA (2009, page 10-14) gives an example of
  # using the Shapiro-Wilk test to test the assumption of normality
  # for nickel concentrations (ppb) in groundwater collected over
  # 4 years.  The data for this example are stored in
  # EPA.09.Ex.10.1.nickel.df.

  EPA.09.Ex.10.1.nickel.df
  #   Month   Well Nickel.ppb
  #1      1 Well.1       58.8
  #2      3 Well.1        1.0
  #3      6 Well.1      262.0
  #4      8 Well.1       56.0
  #5     10 Well.1        8.7
  #6      1 Well.2       19.0
  #7      3 Well.2       81.5
  #8      6 Well.2      331.0
  #9      8 Well.2       14.0
  #10    10 Well.2       64.4
  #11     1 Well.3       39.0
  #12     3 Well.3      151.0
  #13     6 Well.3       27.0
  #14     8 Well.3       21.4
  #15    10 Well.3      578.0
  #16     1 Well.4        3.1
  #17     3 Well.4      942.0
  #18     6 Well.4       85.6
  #19     8 Well.4       10.0
  #20    10 Well.4      637.0

  # Use distChoose with the probability plot correlation method,
  # and for the lognormal distribution specify the
  # mean and CV parameterization:
  #------------------------------------------------------------

  distChoose(Nickel.ppb ~ 1, data = EPA.09.Ex.10.1.nickel.df,
    choices = c("norm", "gamma", "lnormAlt"), method = "ppcc")

  #Results of Choosing Distribution
  #--------------------------------
  #
  #Candidate Distributions:         Normal
  #                                 Gamma
  #                                 Lognormal
  #
  #Choice Method:                   PPCC
  #
  #Type I Error per Test:           0.05
  #
  #Decision:                        Lognormal
  #
  #Estimated Parameter(s):          mean = 213.415628
  #                                 cv   =   2.809377
  #
  #Estimation Method:               mvue
  #
  #Data:                            Nickel.ppb
  #
  #Data Source:                     EPA.09.Ex.10.1.nickel.df
  #
  #Sample Size:                     20
  #
  #Test Results:
  #
  #  Normal
  #    Test Statistic:              r = 0.8199825
  #    P-value:                     5.753418e-05
  #
  #  Gamma
  #    Test Statistic:              r = 0.9749044
  #    P-value:                     0.317334
  #
  #  Lognormal
  #    Test Statistic:              r = 0.9912528
  #    P-value:                     0.9187852

  #--------------------

  # Repeat the above example using the ProUCL method.
  #--------------------------------------------------

  distChoose(Nickel.ppb ~ 1, data = EPA.09.Ex.10.1.nickel.df,
    method = "proucl")

  #Results of Choosing Distribution
  #--------------------------------
  #
  #Candidate Distributions:         Normal
  #                                 Gamma
  #                                 Lognormal
  #
  #Choice Method:                   ProUCL
  #
  #Type I Error per Test:           0.05
  #
  #Decision:                        Gamma
  #
  #Estimated Parameter(s):          shape =   0.5198727
  #                                 scale = 326.0894272
  #
  #Estimation Method:               MLE
  #
  #Data:                            Nickel.ppb
  #
  #Data Source:                     EPA.09.Ex.10.1.nickel.df
  #
  #Sample Size:                     20
  #
  #Test Results:
  #
  #  Normal
  #    Shapiro-Wilk GOF
  #      Test Statistic:            W = 0.6788888
  #      P-value:                   2.17927e-05
  #    Lilliefors (Kolmogorov-Smirnov) GOF
  #      Test Statistic:            D = 0.3267052
  #      P-value:                   5.032807e-06
  #
  #  Gamma
  #    ProUCL Anderson-Darling Gamma GOF
  #      Test Statistic:            A = 0.5076725
  #      P-value:                   >= 0.10
  #    ProUCL Kolmogorov-Smirnov Gamma GOF
  #      Test Statistic:            D = 0.1842904
  #      P-value:                   >= 0.10
  #
  #  Lognormal
  #    Shapiro-Wilk GOF
  #      Test Statistic:            W = 0.978946
  #      P-value:                   0.9197735
  #    Lilliefors (Kolmogorov-Smirnov) GOF
  #      Test Statistic:            D = 0.08405167
  #      P-value:                   0.9699648

  #====================================================================

  if (FALSE) {
  # 1) Simulate 1000 trials where for each trial you:
  #    a) Generate 20 observations from a Gamma distribution with
  #       parameters mean = 10 and CV = 1.
  #    b) Use distChoose with the Shapiro-Wilk method.
  #    c) Use distChoose with the ProUCL method.
  #
  #  2) Compare the proportion of times the
  #     Normal vs. Gamma vs. Lognormal vs. Nonparametric distribution
  #     is chosen for b) and c) above.
  #------------------------------------------------------------------

  set.seed(58)
  N <- 1000

  Choose.fac <- factor(rep("", N), levels = c("Normal", "Gamma", "Lognormal", "Nonparametric"))
  Choose.df <- data.frame(SW = Choose.fac, ProUCL = Choose.fac)

  for(i in 1:N) {
    dat <- rgammaAlt(20, mean = 10, cv = 1)
    Choose.df[i, "SW"]     <- distChoose(dat, method = "sw")$decision
    Choose.df[i, "ProUCL"] <- distChoose(dat, method = "proucl")$decision
  }

  summaryStats(Choose.df, digits = 0)

  #              ProUCL(N) ProUCL(Pct) SW(N) SW(Pct)
  #Normal              443          44    41       4
  #Gamma               546          55   733      73
  #Lognormal             9           1   215      22
  #Nonparametric         2           0    11       1
  #Combined           1000         100  1000     100


  #--------------------


  # Repeat above example for the Lognormal Distribution with mean=10 and CV = 1.
  #-----------------------------------------------------------------------------

  set.seed(297)
  N <- 1000

  Choose.fac <- factor(rep("", N), levels = c("Normal", "Gamma", "Lognormal", "Nonparametric"))
  Choose.df <- data.frame(SW = Choose.fac, ProUCL = Choose.fac)

  for(i in 1:N) {
    dat <- rlnormAlt(20, mean = 10, cv = 1)
    Choose.df[i, "SW"]     <- distChoose(dat, method = "sw")$decision
    Choose.df[i, "ProUCL"] <- distChoose(dat, method = "proucl")$decision
  }

  summaryStats(Choose.df, digits = 0)

  #              ProUCL(N) ProUCL(Pct) SW(N) SW(Pct)
  #Normal              313          31    15       2
  #Gamma               556          56   254      25
  #Lognormal           121          12   706      71
  #Nonparametric        10           1    25       2
  #Combined           1000         100  1000     100


  #--------------------


  # Clean up
  #---------

  rm(N, Choose.fac, Choose.df, i, dat)
  }

Run the code above in your browser using DataLab