
Last chance! 50% off unlimited learning
Sale ends in
Full pipeline for preparing your dataSet set.
prepareSet(dataSet, finalForm = "data.table", verbose = TRUE, ...)
Matrix, data.frame or data.table
"data.table" or "numerical_matrix" (default to data.table)
Should the algorithm talk? (logical, default to TRUE)
Additional parameters to tune pipeline (see details)
A data.table or a numerical matrix (according to finalForm
).
It will perform the following steps:
Correct set: unfactor factor with many values, id dates and numeric that are hiden in character
Transform set: compute differences between every date, transform dates into factors, generate
features from character..., if key
is provided, will perform aggregate according to this key
Filter set: filter constant, in double or bijection variables. If `digits` is provided, will round numeric
Handle NA: will perform fastHandleNa
)
Shape set: will put the result in asked shape (finalForm
) with acceptable columns format.
Additional arguments are available to tune pipeline:
key
Name of a column of dataSet according to which dataSet should be aggregated
(character)
analysisDate
A date at which the dataSet should be aggregated
(differences between every date and analysisDate will be computed) (Date)
n_unfactor
Number of max value in a facotr, set it to -1 to disable
unFactor
function. (numeric, default to 53)
digits
The number of digits after comma (optional, numeric, if set will perform
fastRound
)
dateFormats
List of format of Dates in dataSet (list of characters)
name_separator
character to separate parts of new column names (character, default to ".")
functions
Aggregation functions for numeric columns, see aggregateByKey
(list of functions names (character))
factor_date_type
Aggregation level to factorize date (see
generateFactorFromDate
) (character, default to "yearmonth")
target_col
A target column to perform target encoding, see target_encode
(character)
target_encoding_functions
Functions to perform target encoding, see build_target_encoding
,
if target_col
is not given will not do anything, (list, default to "mean"
)
# NOT RUN {
# Load ugly set
# }
# NOT RUN {
data(messy_adult)
# Have a look to set
head(messy_adult)
# Compute full pipeline
clean_adult <- prepareSet(messy_adult)
# With a reference date
adult_agg <- prepareSet(messy_adult, analysisDate = as.Date("2017-01-01"))
# Add aggregation by country
adult_agg <- prepareSet(messy_adult, analysisDate = as.Date("2017-01-01"), key = "country")
# With some new aggregation functions
power <- function(x){sum(x^2)}
adult_agg <- prepareSet(messy_adult, analysisDate = as.Date("2017-01-01"), key = "country",
functions = c("min", "max", "mean", "power"))
# }
# NOT RUN {
# "##NOT RUN:" mean that this example hasn't been run on CRAN since its long. But you can run it!
# }
Run the code above in your browser using DataLab