EnvStats-package: Package for Environmental Statistics, Including US EPA Guidance

Description

A comprehensive R package for environmental statistics and the successor to the S-PLUS module EnvironmentalStats for S-PLUS (first released in April, 1997). EnvStats provides a set of powerful functions for graphical and statistical analyses of environmental data, with a focus on analyzing chemical concentrations and physical parameters, usually in the context of mandated environmental monitoring. It includes major environmental statistical methods found in the literature and regulatory guidance documents, and extensive help that explains what these methods do, how to use them, and where to find them in the literature. It also includes numerous built-in data sets from regulatory guidance documents and environmental statistics literature, and scripts reproducing analyses presented in the user's manual (Millard, 2013).

For a complete list of functions and datasets, you can do any of the following:

See the help file Functions By Category for a listing of functions by category.
If you are in the on-line help, scroll to the bottom of this help page and click on the Index link.
Type library(help="EnvStats") at the command prompt.

Note: The names of all EnvStats functions start with a lowercase letter, and the names of all EnvStats datasets and data objects start an uppercase letter. You can type newsEnvStats() at the R command prompt for the latest news for the EnvStats package.

Arguments

Details

Package:

EnvStats

Type:

Package

Version:

2.1.1

Date:

2016-06-13

License:

GPL (>=3)

LazyLoad:

yes

A companion file EnvStats-manual.pdf containing a listing of all the current help files is located on the R CRAN web site at http://cran.r-project.org/web/packages/EnvStats/EnvStats.pdf and also in the doc subdirectory of the directory where the EnvStats package was installed. For example, if you installed R under Windows, this file might be located in the directory C:\Program Files\R-*.**.*\library\EnvStats\doc, where *.**.* denotes the version of R you are using (e.g., 3.2.5) or in the directory C:\Users\Name\Documents\R\win-library\*.**.*\EnvStats\doc, where Name denotes your user name on the Windows operating system.

EnvStats comes with companion scripts, located in the scripts subdirectory of the directory where the package was installed. One set of scripts lets you reproduce the examples in the User's Manual (currently is still in preparation). There are also scripts that let you reproduce examples from US EPA guidance documents.

See the References section below for documentation for the predecessor to EnvStats, EnvironmentalStats for S-PLUS for Windows.

Features of EnvStats include:

New functions for computing summary statistics and creating summary plots to compare the distributions of groups side-by-side.

New probability distributions have been added to the ones already available in R, including the extreme value distribution and the zero-modified lognormal (delta) distribution. You can compute quantities associated with these probability distributions (probability density functions, cumulative distribution functions, and quantiles), and generate random numbers from these distributions.

Plot probability distributions so you can see how they change with the value of the distribution parameter(s).

Estimate distribution parameters and distribution quantiles, and compute confidence intervals for commonly used probability distributions, including special methods for the lognormal and gamma distributions.

Perform and plot the results of goodness-of-fit tests:

Observed and Fitted Distributions
Quantile-Quantile Plots
Results of Shaprio-Wilk test, Kolmogorov-Smirnov test, etc.

Includes a new generalized goodness-of-fit test for any continuous distribution.

Functions for assessing optimal Box-Cox data transformations.

Compute parametric and non-parametric prediction intervals, simultaneous prediction intervals, and tolerance intervals.

New functions for hypothesis tests, including:

Nonparametric estimation and tests for seasonal trend
Fisher's one-sample randomization (permutation) test for location
Quantile test to detect a shift in the tail of one population relative to another
Two-sample linear rank tests
Test for serial correlation based on von Neumann rank test

Perform calibration based on a machine signal to determine decision and detection limits and report estimated concentrations along with confidence intervals.

Easily perform power and sample size computations and create companion plots for sampling designs based on confidence intervals, hypothesis tests, prediction intervals, and tolerance intervals.

Handle singly and multiply censored (less-than-detection-limit) data:

Empirical CDF and Quantile-Quantile Plots
Parameter/Quantile Estimation and Confidence Intervals
Prediction and Tolerance Intervals
Goodness-of-Fit Tests
Optimal Box-Cox Transformations
Two-Sample Rank Tests

Functions for performing Monte Carlo simulation and probabilistic risk assessement.

Reproduce specific examples in EPA guidance documents by using built-in data sets from these documents and running companion scripts.

References

Millard, S.P. (2013). EnvStats: An R Package for Environmental Statistics. Springer, New York.

Millard, S.P. (2002). EnvironmentalStats for S-PLUS: User's Manual for Version 2.0. Second Edition. Springer-Verlag, New York. Millard, S.P., and N.K. Neerchal. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton, FL.

Examples

Run this code

  # Look at plots and summary statistics for the TcCB data given in 
  # USEPA (1994b), (the data are stored in EPA.94b.tccb.df). 
  # Arbitrarily set the one censored observation to the censoring level. 
  # Group by the variable Area.

  EPA.94b.tccb.df
  #    TcCB.orig   TcCB Censored      Area
  #1        0.22   0.22    FALSE Reference
  #2        0.23   0.23    FALSE Reference
  #...
  #46       1.20   1.20    FALSE Reference
  #47       1.33   1.33    FALSE Reference
  #48      <0.09   0.09     TRUE   Cleanup
  #49       0.09   0.09    FALSE   Cleanup
  #...
  #123     51.97  51.97    FALSE   Cleanup
  #124    168.64 168.64    FALSE   Cleanup


  # First plot the data
  #--------------------
  dev.new()
  stripChart(TcCB ~ Area, data = EPA.94b.tccb.df, 
    xlab = "Area", ylab = "TcCB (ppb)")
  mtext("TcCB Concentrations by Area", line = 3, cex = 1.25, font = 2)

  dev.new()
  stripChart(log10(TcCB) ~ Area, data = EPA.94b.tccb.df, 
    p.value = TRUE, 
    xlab = "Area", ylab = expression(paste(log[10], " [ TcCB (ppb) ]")))
  mtext(expression(paste(log[10], "(TcCB) Concentrations by Area")), 
    line = 3, cex = 1.25, font = 2)

  #--------------------------------------------------------------------

  # Now compute summary statistics
  #-------------------------------
  
  sum(EPA.94b.tccb.df$Censored) 
  #[1] 1 

  with(EPA.94b.tccb.df, TcCB[Censored])
  #0.09 

  # Summary statistics will treat the one censored value 
  # as assuming the detection limit.

  summaryFull(TcCB ~ Area, data = EPA.94b.tccb.df)
  #                             Cleanup  Reference
  #N                             77       47      
  #Mean                           3.915    0.5985 
  #Median                         0.43     0.54   
  #10% Trimmed Mean               0.6846   0.5728 
  #Geometric Mean                 0.5784   0.5382 
  #Skew                           7.717    0.9019 
  #Kurtosis                      62.67     0.132  
  #Min                            0.09     0.22   
  #Max                          168.6      1.33   
  #Range                        168.5      1.11   
  #1st Quartile                   0.23     0.39   
  #3rd Quartile                   1.1      0.75   
  #Standard Deviation            20.02     0.2836 
  #Geometric Standard Deviation   3.898    1.597  
  #Interquartile Range            0.87     0.36   
  #Median Absolute Deviation      0.3558   0.2669 
  #Coefficient of Variation       5.112    0.4739

  summaryStats(TcCB ~ Area, data = EPA.94b.tccb.df, digits = 1)
  #           N Mean   SD Median Min   Max
  #Cleanup   77  3.9 20.0    0.4 0.1 168.6
  #Reference 47  0.6  0.3    0.5 0.2   1.3

  #----------------------------------------------------------------

  # Compute Shapiro-Wilk Goodness-of-Fit statistic for the 
  # Reference Area TcCB data assuming a lognormal distribution
  #-----------------------------------------------------------
  
  sw.list <- gofTest(TcCB ~ 1, data = EPA.94b.tccb.df, 
    subset = Area == "Reference", dist = "lnorm")
  sw.list

  # Results of Goodness-of-Fit Test
  # -------------------------------
  #
  # Test Method:                     Shapiro-Wilk GOF
  #
  # Hypothesized Distribution:       Lognormal
  #
  # Estimated Parameter(s):          meanlog = -0.6195712
  #                                  sdlog   =  0.4679530
  #
  # Estimation Method:               mvue
  #
  # Data:                            TcCB
  #
  # Subset With:                     Area == "Reference"
  #
  # Data Source:                     EPA.94b.tccb.df
  #
  # Sample Size:                     47
  #
  # Test Statistic:                  W = 0.978638
  #
  # Test Statistic Parameter:        n = 47
  #
  # P-value:                         0.5371935
  #
  # Alternative Hypothesis:          True cdf does not equal the
  #                                  Lognormal Distribution.

  #----------

  # Plot results of GOF test
  dev.new()
  plot(sw.list)

  #----------------------------------------------------------------

  # Based on the Reference Area data, estimate 90th percentile 
  # and compute a 95% confidence limit for the 90th percentile 
  # assuming a lognormal distribution.
  #------------------------------------------------------------

  with(EPA.94b.tccb.df, 
    eqlnorm(TcCB[Area == "Reference"], p = 0.9, ci = TRUE))

  # Results of Distribution Parameter Estimation
  # --------------------------------------------
  #
  # Assumed Distribution:            Lognormal
  #
  # Estimated Parameter(s):          meanlog = -0.6195712
  #                                  sdlog   =  0.4679530
  #
  # Estimation Method:               mvue
  #
  # Estimated Quantile(s):           90'th %ile = 0.9803307
  #
  # Quantile Estimation Method:      qmle
  #
  # Data:                            TcCB[Area == "Reference"]
  #
  # Sample Size:                     47
  #
  # Confidence Interval for:         90'th %ile
  #
  # Confidence Interval Method:      Exact
  #
  # Confidence Interval Type:        two-sided
  #
  # Confidence Level:                95%
  #
  # Confidence Interval:             LCL = 0.8358791
                                     UCL = 1.2154977
  #----------

  # Cleanup
  rm(TcCB.ref, sw.list)

Run the code above in your browser using DataLab