Apply a Box-Cox power transformation to a set of data to attempt to induce normality and homogeneity of variance.

`boxcoxTransform(x, lambda, eps = .Machine$double.eps)`

x

a numeric vector of positive numbers.

lambda

finite numeric scalar indicating what power to use for the Box-Cox transformation.

eps

finite, positive numeric scalar. When the absolute value of `lambda`

is less
than `eps`

, lambda is assumed to be 0 for the Box-Cox transformation.
The default value is `eps=.Machine$double.eps`

.

numeric vector of transformed observations.

Two common assumptions for several standard parametric hypothesis tests are:

The observations all come from a normal distribution.

The observations all come from distributions with the same variance.

For example, the standard one-sample t-test assumes all the observations come from the same normal distribution, and the standard two-sample t-test assumes that all the observations come from a normal distribution with the same variance, although the mean may differ between the two groups. For standard linear regression models, these assumptions can be stated as: the error terms all come from a normal distribution with mean 0 and and a constant variance.

Often, especially with environmental data, the above assumptions do not hold because the original data are skewed and/or they follow a distribution that is not really shaped like a normal distribution. It is sometimes possible, however, to transform the original data so that the transformed observations in fact come from a normal distribution or close to a normal distribution. The transformation may also induce homogeneity of variance and, for the case of a linear regression model, a linear relationship between the response and predictor variable(s).

Sometimes, theoretical considerations indicate an appropriate transformation. For example, count data often follow a Poisson distribution, and it can be shown that taking the square root of observations from a Poisson distribution tends to make these data look more bell-shaped (Johnson et al., 1992, p.163; Johnson and Wichern, 2007, p.192; Zar, 2010, p.291). A common example in the environmental field is that chemical concentration data often appear to come from a lognormal distribution or some other positively-skewed distribution (e.g., gamma). In this case, taking the logarithm of the observations often appears to yield normally distributed data.

Ideally, a data transformation is chosen based on knowledge of the process generating the data, as well as graphical tools such as quantile-quantile plots and histograms.

Box and Cox (1964) presented a formalized method for deciding on a data transformation. Given a random variable \(X\) from some distribution with only positive values, the Box-Cox family of power transformations is defined as:

\(Y\) | = | \(\frac{X^\lambda - 1}{\lambda}\) | \(\lambda \ne 0\) |

where \(Y\) is assumed to come from a normal distribution. This transformation is continuous in \(\lambda\). Note that this transformation also preserves ordering; that is, if \(X_1 < X_2\) then \(Y_1 < Y_2\).

Box and Cox (1964) proposed choosing the appropriate value of \(\lambda\)
based on maximizing a likelihood function. See the help file for
`boxcox`

for details.

Note that for non-zero values of \(\lambda\), instead of using the formula of Box and Cox in Equation (1), you may simply use the power transformation: $$Y = X^\lambda \;\;\;\;\;\; (2)$$ since these two equations differ only by a scale difference and origin shift, and the essential character of the transformed distribution remains unchanged.

The value \(\lambda=1\) corresponds to no transformation. Values of \(\lambda\) less than 1 shrink large values of \(X\), and are therefore useful for transforming positively-skewed (right-skewed) data. Values of \(\lambda\) larger than 1 inflate large values of \(X\), and are therefore useful for transforming negatively-skewed (left-skewed) data (Helsel and Hirsch, 1992, pp.13-14; Johnson and Wichern, 2007, p.193). Commonly used values of \(\lambda\) include 0 (log transformation), 0.5 (square-root transformation), -1 (reciprocal), and -0.5 (reciprocal root).

It is often recommend that when dealing with several similar data sets, it is best to find a common transformation that works reasonably well for all the data sets, rather than using slightly different transformations for each data set (Helsel and Hirsch, 1992, p.14; Shumway et al., 1989).

Berthouex, P.M., and L.C. Brown. (2002).
*Statistics for Environmental Engineers, Second Edition*.
Lewis Publishers, Boca Raton, FL.

Box, G.E.P., and D.R. Cox. (1964). An Analysis of Transformations
(with Discussion). *Journal of the Royal Statistical Society, Series B*
**26**(2), 211--252.

Draper, N., and H. Smith. (1998). *Applied Regression Analysis*. Third Edition.
John Wiley and Sons, New York, pp.47-53.

Gilbert, R.O. (1987). *Statistical Methods for Environmental Pollution
Monitoring*. Van Nostrand Reinhold, NY.

Helsel, D.R., and R.M. Hirsch. (1992).
*Statistical Methods in Water Resources Research*.
Elsevier, New York, NY.

Hinkley, D.V., and G. Runger. (1984). The Analysis of Transformed Data
(with Discussion). *Journal of the American Statistical Association*
**79**, 302--320.

Hoaglin, D.C., F.M. Mosteller, and J.W. Tukey, eds. (1983).
*Understanding Robust and Exploratory Data Analysis*.
John Wiley and Sons, New York, Chapter 4.

Hoaglin, D.C. (1988). Transformations in Everyday Experience.
*Chance* **1**, 40--45.

Johnson, N. L., S. Kotz, and A.W. Kemp. (1992). *Univariate
Discrete Distributions, Second Edition*. John Wiley and Sons, New York,
p.163.

Johnson, R.A., and D.W. Wichern. (2007).
*Applied Multivariate Statistical Analysis, Sixth Edition*.
Pearson Prentice Hall, Upper Saddle River, NJ, pp.192--195.

Shumway, R.H., A.S. Azari, and P. Johnson. (1989).
Estimating Mean Concentrations Under Transformations for Environmental
Data With Detection Limits. *Technometrics* **31**(3), 347--356.

Stoline, M.R. (1991). An Examination of the Lognormal and Box and Cox
Family of Transformations in Fitting Environmental Data.
*Environmetrics* **2**(1), 85--106.

van Belle, G., L.D. Fisher, Heagerty, P.J., and Lumley, T. (2004).
*Biostatistics: A Methodology for the Health Sciences, 2nd Edition*.
John Wiley & Sons, New York.

Zar, J.H. (2010). *Biostatistical Analysis*.
Fifth Edition. Prentice-Hall, Upper Saddle River, NJ,
Chapter 13.

```
# NOT RUN {
# Generate 30 observations from a lognormal distribution with
# mean=10 and cv=2, then look at some normal quantile-quantile
# plots for various transformations.
# (Note: the call to set.seed simply allows you to reproduce this example.)
set.seed(250)
x <- rlnormAlt(30, mean = 10, cv = 2)
dev.new()
qqPlot(x, add.line = TRUE)
dev.new()
qqPlot(boxcoxTransform(x, lambda = 0.5), add.line = TRUE)
dev.new()
qqPlot(boxcoxTransform(x, lambda = 0), add.line = TRUE)
# Clean up
#---------
rm(x)
# }
```

Run the code above in your browser using DataCamp Workspace