f_boxcox: f_boxcox: A User-Friendly Box-Cox Transformation

Description

Performs a Box-Cox transformation on a dataset to stabilize variance and make the data more normally distributed. It also provides diagnostic plots and tests for normality. The transformation is based on code of MASS/R/boxcox.R. The function prints $\lambda$ to the console and returns (output) the transformed data set.

Usage

f_boxcox(
  data = data,
  lambda = seq(-2, 2, 1/10),
  plots = FALSE,
  transform.data = TRUE,
  interp = (plots && (length(lambda) < 100)),
  eps = 1/50,
  xlab = expression(lambda),
  ylab = "log-Likelihood",
  alpha = 0.05,
  open_generated_files = TRUE,
  close_generated_files = FALSE,
  output_type = "off",
  output_file = NULL,
  output_dir = NULL,
  save_in_wdir = FALSE,
  ...
)

Value

An object of class 'f_boxcox' containing, among others, results from the boxcox transformation, lambda, the input data, transformed data, Shapiro-Wilk test on original and transformed data. Using the option "output_type", it can also generate output in the form of: R Markdown code, 'Word', or 'pdf' files. Includes print and plot methods for 'f_boxcox' objects.

Arguments

data: A numeric vector or a data frame with a single numeric column. The data to be transformed.
lambda: A numeric vector of $\lambda$ values to evaluate for the Box-Cox transformation. Default is seq(-2, 2, 0.1).
plots: Logical. If TRUE, plots log-likelihood of the Box-Cox transformation, Histograms and Q-Q plots of the original and transformed data. Default is FALSE.
transform.data: Logical. If TRUE, returns the transformed data. Default is TRUE.
interp: Logical. If TRUE and fewer than 100 $\lambda$ values are provided, interpolates for smooth plotting. Default is determined by log-likelihood of the Box-Cox transformation and the length of $\lambda$.
eps: A small positive value used to determine when to switch from the power transformation to the log transformation for numerical stability. Default is 1/50.
xlab: Character string. Label for the x-axis in plots. Default is an expression object representing $\lambda$.
ylab: Character string. Label for the y-axis in plots. Default is "log-Likelihood".
alpha: Numeric. Significance level for the Shapiro-Wilk test of normality. Default is 0.05.
open_generated_files: Logical. If TRUE, opens the generated output files ('pdf', 'Word' or 'Excel') files depending on the output format. This to directly view the results after creation. Files are stored in tempdir(). Default is TRUE.
close_generated_files: Logical. If TRUE, closes open 'Word' files depending on the output format. This to be able to save the newly generated files. 'Pdf' files should also be closed before using the function and cannot be automatically closed.
output_type: Character string specifying the output format: "pdf", "word", "rmd", "off" (no file generated) or "console". The option "console" forces output to be printed. Default is "off".
output_file: A character string specifying the name of the output file (without extension). If NULL, a default name based on the dataset name is generated.
output_dir: Character string specifying the name of the directory of the output file. Default is tempdir(). If the output_file already contains a directory name output_dir can be omitted, if used it overwrites the dir specified in output_file.
save_in_wdir: Logical. If TRUE, saves the file in the working directory Default is FALSE, to avoid unintended changes to the global environment. If the output_dir is specified save_in_wdir is overwritten with output_dir.
...: Additional arguments passed to plotting functions.

Author

Sander H. van Delden plantmind@proton.me
Salvatore Mangiafico, mangiafico@njaes.rutgers.edu
W. N. Venables and B. D. Ripley

Details

The function uses the following formula for transformation: $$ y(\lambda) = \begin{cases} \frac{y^\lambda - 1}{\lambda}, & \lambda \neq 0 \\ \log(y), & \lambda = 0 \end{cases} $$

where ($y$) is the data being transformed, and ($\lambda$) the transformation parameter, which is estimated from the data using maximum likelihood. The function computes the Box-Cox transformation for a range of $\lambda$ values and identifies the $\lambda$ that maximizes the log-likelihood function. The beauty of this transformation is that, it checks suitability of many of the common transformations in one run. Examples of most common transformations and their $\lambda$ value is given below:

$\lambda$-Value	Transformation
-----------------------	-----------------------
-2	$\frac{1}{x^2}$
	-1
$\frac{1}{x} $
-0.5	$\frac{1}{\sqrt{x}}$
	0
$log(x)$
0.5	$\sqrt{x}$
	1
$x$
2	$x^2$
-----------------------	-----------------------

If the estimated transformation parameter closely aligns with one of the values listed in the previous table, it is generally advisable to select the table value rather than the precise estimated value. This approach simplifies interpretation and practical application.

The function provides diagnostic plots: a plot of log-likelihood against $\lambda$ values and a Q-Q plot of the transformed data.It also performs a Shapiro-Wilk test for normality on the transformed data if the sample size is less than or equal to 5000.

Note: For sample sizes greater than 5000, Shapiro-Wilk test results are not provided due to limitations in its applicability.

This function requires [Pandoc](https://github.com/jgm/pandoc/releases/tag) (version 1.12.3 or higher), a universal document converter.

Windows: Install Pandoc and ensure the installation folder
(e.g., "C:/Users/your_username/AppData/Local/Pandoc") is added to your system PATH.
macOS: If using Homebrew, Pandoc is typically installed in "/usr/local/bin". Alternatively, download the .pkg installer and verify that the binary’s location is in your PATH.
Linux: Install Pandoc through your distribution’s package manager (commonly installed in "/usr/bin" or "/usr/local/bin") or manually, and ensure the directory containing Pandoc is in your PATH.
If Pandoc is not found, this function may not work as intended.

References

The core of calculating $\lambda$ and the plotting was taken from:
file MASS/R/boxcox.R copyright (C) 1994-2004 W. N. Venables and B. D. Ripley

Some code to present the result was taken and modified from file:
rcompanion/R/transformTukey.r. (Developed by Salvatore Mangiafico)

https://rcompanion.org/handbook/I_12.html

The explanation on BoxCox transformation provided here was provided by r-coder:

https://r-coder.com/box-cox-transformation-r/

Examples

Run this code

# Create non-normal data in a data.frame or vector.
df   <- data.frame(values = rlnorm(100, meanlog = 0, sdlog = 1))

# Store the transformation in object "bc".
bc <- f_boxcox(df$values)

# Print lambda and Shaprio.
print(bc)

# Plot the QQ plots, Histograms and Lambda Log-Likelihood estimation.
plot(bc)

# Or Directly use the transformed data from the f_boxcox object.
df$values_transformed <- f_boxcox(df$values)$transformed_data
print(df$values_transformed)

Run the code above in your browser using DataLab

\(\lambda\)-Value	Transformation
-----------------------	-----------------------
-2	\(\frac{1}{x^2}\)
	-1
\(\frac{1}{x} \)
-0.5	\(\frac{1}{\sqrt{x}}\)
	0
\(log(x)\)
0.5	\(\sqrt{x}\)
	1
\(x\)
2	\(x^2\)
-----------------------	-----------------------