bestNormalize: Calculate and perform best normalizing transformation

Description

Performs a suite of normalizing transformations, and selects the best one on the basis of the Pearson P test statistic for normality. The transformation that has the lowest P (calculated on the transformed data) is selected. See details for more information.

Usage

bestNormalize(x, standardize = TRUE, allow_orderNorm = TRUE,
  allow_lambert_s = FALSE, allow_lambert_h = FALSE,
  out_of_sample = TRUE, cluster = NULL, k = 10, r = 5,
  loo = FALSE, warn = TRUE, quiet = FALSE)
# S3 method for bestNormalize
predict(object, newdata = NULL,
  inverse = FALSE, ...)
# S3 method for bestNormalize
print(x, ...)

Arguments

A vector to normalize

standardize

If TRUE, the transformed values are also centered and scaled, such that the transformation attempts a standard normal. This will not change the normality statistic.

allow_orderNorm

set to FALSE if orderNorm should not be applied

allow_lambert_s

Set to TRUE if the lambertW of type "s" should be applied (see details)

allow_lambert_h

Set to TRUE if the lambertW of type "h" should be applied (see details)

out_of_sample

if FALSE, estimates quickly in-sample performance

cluster

name of cluster set using makeCluster

number of folds

number of repeats

loo

should leave-one-out CV be used instead of repeated CV? (see details)

warn

Should bestNormalize warn when a method doesn't work?

quiet

Should a progress-bar not be displayed for cross-validation progress?

object

an object of class 'bestNormalize'

newdata

a vector of data to be (reverse) transformed

inverse

if TRUE, performs reverse transformation

...

additional arguments

Value

A list of class bestNormalize with elements

x.t

transformed original data

original data

norm_stats

Pearson's Pearson's P / degrees of freedom

method

out-of-sample or in-sample, number of folds + repeats

chosen_transform

the chosen transformation (of appropriate class)

other_transforms

the other transformations (of appropriate class)

oos_preds

Out-of-sample predictions (if loo == TRUE) or normalization stats

The predict function returns the numeric value of the transformation performed on new data, and allows for the inverse transformation as well.

Details

bestNormalize estimates the optimal normalizing transformation. This transformation can be performed on new data, and inverted, via the predict function.

This function currently estimates the Yeo-Johnson transformation, the Box Cox transformation (if the data is positive), the log_10(x+a) transformation, the square-root (x+a) transformation, and the arcsinh transformation. a is set to max(0, -min(x) + eps) by default. If allow_orderNorm == TRUE and if out_of_sample == FALSE then the ordered quantile normalization technique will likely be chosen since it essentially forces the data to follow a normal distribution. More information on the orderNorm technique can be found in the package vignette, or using ?orderNorm.

Repeated cross-validation is used by default to estimate the out-of-sample performance of each transformation if out_of_sample = TRUE. While this can take some time, users can speed it up by creating a cluster via the parallel package's makeCluster function, and passing the name of this cluster to bestNormalize via the cl argument. For best performance, we recommend the number of clusters to be set to the number of repeats r. Care should be taken to account for the number of observations per fold; to small a number and the estimated normality statistic could be inaccurate, or at least suffer from high variability.

As of version 1.3, users can use leave-one-out cross-validation as well for each method by setting loo to TRUE. This will take a lot of time for bigger vectors, but it will have the most accurate estimate of normalization efficacy. Note that if this method is selected, arguments k, r are ignored. This method will still work in parallel with the cl argument.

NOTE: Only the Lambert technique of type = "s" (skew) ensures that the transformation is consistently 1-1, so it is the only method currently used in bestNormalize(). Use type = "h" or type = 'hh' at risk of not having this estimate 1-1 transform. These alternative types are effective when the data has exceptionally heavy tails, e.g. the Cauchy distribution. Additionally, as of v. 1.2.0, Lambert of type "s" is not used by default in bestNormalize() since it uses multiple threads on some Linux systems, which is not allowed on CRAN checks. Set allow_lambert_s = TRUE in order to test this transformation as well. Note that the Lambert of type "h" can also be done by setting allow_lambert_h = TRUE, however this can take significantly longer to run.

Examples

Run this code

# NOT RUN {
x <- rgamma(100, 1, 1)

# }
# NOT RUN {
# With Repeated CV
BN_obj <- bestNormalize(x)
BN_obj
p <- predict(BN_obj)
x2 <- predict(BN_obj, newdata = p, inverse = TRUE)

all.equal(x2, x)
# }
# NOT RUN {

# }
# NOT RUN {
# With leave-one-out CV
BN_obj <- bestNormalize(x, loo = TRUE)
BN_obj
p <- predict(BN_obj)
x2 <- predict(BN_obj, newdata = p, inverse = TRUE)

all.equal(x2, x)
# }
# NOT RUN {
# Without CV
BN_obj <- bestNormalize(x, allow_orderNorm = FALSE, out_of_sample = FALSE)
BN_obj
p <- predict(BN_obj)
x2 <- predict(BN_obj, newdata = p, inverse = TRUE)

all.equal(x2, x)


# }

Run the code above in your browser using DataLab