prepareSet: Preparation pipeline

Description

Full pipeline for preparing your dataSet set.

Usage

prepareSet(dataSet, finalForm = "data.table", verbose = TRUE, ...)

Arguments

dataSet

Matrix, data.frame or data.table

finalForm

"data.table" or "numerical_matrix" (default to data.table)

verbose

Should the algorithm talk? (logical, default to TRUE)

...

Additional parameters to tune pipeline (see details)

Value

A data.table or a numerical matrix (according to finalForm). It will perform the following steps:

Correct set: unfactor factor with many values, id dates and numeric that are hiden in character
Transform set: compute differences between every date, transform dates into factors, generate features from character..., if key is provided, will perform aggregate according to this key
Filter set: filter constant, in double or bijection variables. If `digits` is provided, will round numeric
Handle NA: will perform fastHandleNa)
Shape set: will put the result in asked shape (finalForm) with acceptable columns format.

Details

Additional arguments are available to tune pipeline:

key Name of a column of dataSet according to which dataSet should be aggregated (character)
analysisDate A date at which the dataSet should be aggregated (differences between every date and analysisDate will be computed) (Date)
n_unfactor Number of max value in a facotr, set it to -1 to disable unFactor function. (numeric, default to 53)
digits The number of digits after comma (optional, numeric, if set will perform fastRound)
dateFormats List of format of Dates in dataSet (list of characters)
name_separator character to separate parts of new column names (character, default to ".")
functions Aggregation functions for numeric columns, see aggregateByKey (list of functions names (character))
factor_date_type Aggregation level to factorize date (see generateFactorFromDate) (character, default to "yearmonth")
target_col A target column to perform target encoding, see target_encode (character)
target_encoding_functions Functions to perform target encoding, see build_target_encoding, if target_col is not given will not do anything, (list, default to "mean")

Examples

Run this code

# NOT RUN {
# Load ugly set
# }
# NOT RUN {
data(messy_adult)

# Have a look to set
head(messy_adult)

# Compute full pipeline
clean_adult <- prepareSet(messy_adult)

# With a reference date
adult_agg <- prepareSet(messy_adult, analysisDate = as.Date("2017-01-01"))

# Add aggregation by country
adult_agg <- prepareSet(messy_adult, analysisDate = as.Date("2017-01-01"), key = "country")

# With some new aggregation functions
power <- function(x){sum(x^2)}
adult_agg <- prepareSet(messy_adult, analysisDate = as.Date("2017-01-01"), key = "country", 
                        functions = c("min", "max", "mean", "power"))
# }
# NOT RUN {
# "##NOT RUN:" mean that this example hasn't been run on CRAN since its long. But you can run it!
# }

Run the code above in your browser using DataLab