# bestNormalize v1.4.2

Monthly downloads

## Normalizing Transformation Functions

Estimate a suite of normalizing transformations, including
a new adaptation of a technique based on ranks which can guarantee
normally distributed transformed data if there are no ties: ordered
quantile normalization (ORQ). ORQ normalization combines a rank-mapping
approach with a shifted logit approximation that allows
the transformation to work on data outside the original domain. It is
also able to handle new data within the original domain via linear
interpolation. The package is built to estimate the best normalizing
transformation for a vector consistently and accurately. It implements
the Box-Cox transformation, the Yeo-Johnson transformation, three types
of Lambert WxF transformations, and the ordered quantile normalization
transformation. It also estimates the normalization efficacy of other
commonly used transformations.

## Readme

# bestNormalize: Flexibly calculate the best normalizing transformation for a vector

The `bestNormalize`

R package was designed to help find a normalizing
transformation for a vector. There are many techniques that have been
developed in this aim, however each has been subject to their own
strengths/weaknesses, and it is unclear on how to decide which will work
best until the data is oberved. This package will look at a range of
possible transformations and return the best one, i.e. the one that
makes it look the *most* normal.

Note that some authors use the term “normalize” differently than in this package. We define “normalize”: to transform a vector of data in such a way that the transformed values follow a Gaussian distribution (or equivalently, a bell curve). This is in contrast to other such techniques designed to transform values to the 0-1 range, or to the -1 to 1 range.

This package also introduces a new adaptation of a normalization
technique, which we call Ordered Quantile normalization (`orderNorm()`

,
or ORQ). ORQ transforms the data based off of a rank mapping to the
normal distribution. This allows us to *guarantee* normally distributed
transformed data (if ties are not present). The adaptation uses a
shifted logit approximation on the ranks transformation to perform the
transformation on newly observed data outside of the original domain. On
new data within the original domain, the transformation uses linear
interpolation of the fitted transformation.

To evaluate the efficacy of the normalization technique, the
`bestNormalize()`

function implements repeated cross-validation to
estimate the Pearson’s P statistic divided by its degrees of freedom.
This is called the “Normality statistic”, and if it is close to 1 (or
less), then the transformation can be thought of as working well. The
function is designed to select the transformation that produces the
lowest P / df value, when estimated on out-of-sample data (estimating
this on in-sample data will always choose the orderNorm technique, and
is generally not the main goal of these procedures).

## Installation

You can install the most recent (devel) version of bestNormalize from github with:

```
# install.packages("devtools")
devtools::install_github("petersonR/bestNormalize")
```

Or, you can download it from CRAN with:

```
install.packages("bestNormalize")
```

## Example

In this example, we generate 1000 draws from a gamma distribution, and normalize them:

```
library(bestNormalize)
```

```
set.seed(100)
x <- rgamma(1000, 1, 1)
# Estimate best transformation with repeated cross-validation
BN_obj <- bestNormalize(x, allow_lambert_s = TRUE)
BN_obj
#> Best Normalizing transformation with 1000 Observations
#> Estimated Normality Statistics (Pearson P / df, lower => more normal):
#> - No transform: 6.966
#> - Box-Cox: 1.1176
#> - Lambert's W (type s): 1.1004
#> - Log_b(x+a): 2.0489
#> - sqrt(x+a): 1.6444
#> - exp(x): 50.7939
#> - arcsinh(x): 3.6245
#> - Yeo-Johnson: 1.933
#> - orderNorm: 1.2694
#> Estimation method: Out-of-sample via CV with 10 folds and 5 repeats
#>
#> Based off these, bestNormalize chose:
#> Standardized Lambert WxF Transformation of type s with 1000 nonmissing obs.:
#> Estimated statistics:
#> - gamma = 0.4129
#> - mean (before standardization) = 0.667563
#> - sd (before standardization) = 0.7488649
# Perform transformation
gx <- predict(BN_obj)
# Perform reverse transformation
x2 <- predict(BN_obj, newdata = gx, inverse = TRUE)
# Prove the transformation is 1:1
all.equal(x2, x)
#> [1] TRUE
```

As of version 1.3, the package supports leave-one-out cross-validation
as well. ORQ normalization works very well when the size of the test
dataset is low relative to the training data set, so it will often be
selected via leave-one-out cross-validation (which is why we set
`allow_orderNorm = FALSE`

here).

```
(BN_loo <- bestNormalize(x, allow_orderNorm = FALSE, allow_lambert_s = TRUE, loo = TRUE))
#> Note: passing a cluster (?makeCluster) to bestNormalize can speed up CV process
#> Best Normalizing transformation with 1000 Observations
#> Estimated Normality Statistics (Pearson P / df, lower => more normal):
#> - No transform: 26.624
#> - Box-Cox: 0.8077
#> - Lambert's W (type s): 1.269
#> - Log_b(x+a): 4.5374
#> - sqrt(x+a): 3.3655
#> - exp(x): 451.435
#> - arcsinh(x): 14.0712
#> - Yeo-Johnson: 5.7997
#> Estimation method: Out-of-sample via leave-one-out CV
#>
#> Based off these, bestNormalize chose:
#> Standardized Box Cox Transformation with 1000 nonmissing obs.:
#> Estimated statistics:
#> - lambda = 0.2739638
#> - mean (before standardization) = -0.3870903
#> - sd (before standardization) = 1.045498
```

It is also possible to visualize these transformations:

```
plot(BN_obj, leg_loc = "bottomright")
```

For a more in depth tutorial, please consult the package vignette.

## Functions in bestNormalize

Name | Description | |

binarize | Binarize | |

boxcox | Box-Cox Normalization | |

sqrt_x | sqrt(x + a) Normalization | |

bestNormalize | Calculate and perform best normalizing transformation | |

yeojohnson | Yeo-Johnson Normalization | |

autotrader | Prices of 6,283 cars listed on Autotrader | |

no_transform | Identity transformation | |

bestNormalize-package | bestNormalize: Flexibly calculate the best normalizing transformation for a vector | |

exp_x | exp(x) Transformation | |

log_x | Log(x + a) Transformation | |

arcsinh_x | arcsinh(x) Transformation | |

lambert | Lambert W x F Normalization | |

plot.bestNormalize | Transformation plotting | |

orderNorm | Calculate and perform Ordered Quantile normalizing transformation | |

No Results! |

## Vignettes of bestNormalize

Name | ||

bestNormalize.Rmd | ||

parallel_timings.jpg | ||

No Results! |

## Last month downloads

## Details

Type | Package |

Date | 2019-08-20 |

URL | https://github.com/petersonR/bestNormalize |

License | GPL-3 |

VignetteBuilder | knitr |

LazyData | true |

RoxygenNote | 6.1.1 |

Encoding | UTF-8 |

NeedsCompilation | no |

Packaged | 2019-08-20 19:31:10 UTC; ryanpeterson |

Repository | CRAN |

Date/Publication | 2019-08-20 21:20:13 UTC |

#### Include our badge in your README

```
[![Rdoc](http://www.rdocumentation.org/badges/version/bestNormalize)](http://www.rdocumentation.org/packages/bestNormalize)
```