nroPreprocess: Data cleaning and standardization

Description

Convert to numerical values, remove unusable rows and columns, and standardize scale of each variable.

Usage

nroPreprocess(data, method = "standard", clip = 3.0, resolution = 100)

Arguments

data

A matrix or a data frame.

method

Method for standardizing scale and location, see details below.

clip

Range for clipping extreme values in multiples of standard deviations.

resolution

Maximum number of sampling points to capture distribution shape.

Value

A matrix (or data frame) of numerical values. A value mapping model is stored in the attribute "mapping". The names of binary columns are stored in the attribute "binary".

Details

Standardization methods include empty string for no action, "standard" for centering by mean and division by standard deviation, "uniform" for normalized ranks between -1 and 1 and "tapered" for a version of the rank-based method that puts more samples around zero.

Clipping is not applied if the method is rank-based.

Examples

Run this code

# NOT RUN {
# Import data.
fname <- system.file("extdata", "finndiane.txt", package = "Numero")
dataset <- read.delim(file = fname)

# Show original data characteristics.
print(summary(dataset))

# Detect binary columns.
ds <- nroPreprocess(dataset, method = "")
print(attr(ds,"binary"))

# Centering and scaling cholesterol.
ds <- nroPreprocess(dataset$CHOL)
print(summary(ds))

# Centering and scaling.
ds <- nroPreprocess(dataset)
print(summary(ds))

# Tapered ranks.
ds <- nroPreprocess(dataset, method = "tapered")
print(summary(ds))
# }

Run the code above in your browser using DataLab