dataPreparation

Data preparation accounts for about 80% of the work during a data science projet. Let's take that number down. dataPreparation will allow you to do most of the painful data preparation for a data science project with a minimum amount of code.

This package is

fast (use data.table and exponential search)
RAM efficient (perform operations by reference and column-wise to avoid copying data)
stable (most exceptions are handled)
verbose (log a lot)

Main preparation steps

Before using any machine learning (ML) algorithm, one need to prepare its data. Preparing a data set for a data science project can be long and tricky. The main steps are the followings:

Read: load the data set (this package don't treat this point: for csv we recommend data.table::fread)
Correct: most of the times, there are some mistake after reading, wrong format... one have to correct them
Transform: aggregating according to a key, computing differences between dates, ... in order to have information usable for a ML algorithm (aka: numeric or categorical)
Filter: get read of useless information in order to speed up computation
Handle NA: replace missing values
Shape: put your data set in a nice shape usable by a ML algorithm

Here are the functions available in this package to tackle those issues:

Correct	Transform	Filter	Handle NA	Shape
findAndTransformDates	diffDates	fastFilterVariables	fastHandleNa	shapeSet
findAndTransformNumerics	aggregateByKey	whichAreConstant		setAsNumericMatrix
setColAsCharacter	setColAsFactorOrLogical	whichAreInDouble
setColAsNumeric		whichAreBijection
setColAsDate		fastRound

All of those functions are integrated in the full pipeline function prepareSet.

For more details on how it work go check our tutorial

Getting started: 30 seconds to dataPreparation

Installation

Install the package from CRAN:

install.package("dataPreparation")

Install the package from github:

library(devtools)
install_github("ELToulemonde/dataPreparation")

Test it

Load a toy data set

library(dataPreparation)
data(messy_adult)
head(messy_adult)

Perform full pipeline function

clean_adult <- prepareSet(messy_adult)
head(clean_adult)

That's it. For all functions, you can check out documentation and/or tutorial vignette.

How to Contribute

dataPreparation has been developed and used by many active community members. Your help is very valuable to make it better for everyone.

Check out call for contributions to see what can be improved, or open an issue if you want something.
Contribute to add new usesfull features.
Contribute to the tests to make it more reliable.
Contribute to the documents to make it clearer for everyone.
Contribute to the examples to share your experience with other users.
Open issue if you met problems during development.

For more details, please refer to CONTRIBUTING.

dataPreparation

Main preparation steps

Getting started: 30 seconds to dataPreparation

Installation

Test it

How to Contribute

Copy Link

Version

Install

Monthly Downloads

Version

License

Maintainer

Last Published

Functions in dataPreparation (0.1)