Learn R Programming

⚠️There's a newer version (1.1.1) of this package.Take me there.

dataPreparation

Data preparation accounts for about 80% of the work during a data science projet. Let's take that number down. dataPreparation will allow you to do most of the painful data preparation for a data science project with a minimum amount of code.

This package is

  • fast (use data.table and exponential search)
  • RAM efficient (perform operations by reference and column-wise to avoid copying data)
  • stable (most exceptions are handled)
  • verbose (log a lot)

Main preparation steps

Before using any machine learning (ML) algorithm, one need to prepare its data. Preparing a data set for a data science project can be long and tricky. The main steps are the followings:

  • Read: load the data set (this package don't treat this point: for csv we recommend data.table::fread)
  • Correct: most of the times, there are some mistake after reading, wrong format... one have to correct them
  • Transform: aggregating according to a key, computing differences between dates, ... in order to have information usable for a ML algorithm (aka: numeric or categorical)
  • Filter: get read of useless information in order to speed up computation
  • Handle NA: replace missing values
  • Shape: put your data set in a nice shape usable by a ML algorithm

Here are the functions available in this package to tackle those issues:

CorrectTransformFilterHandle NAShape
findAndTransformDatesdiffDatesfastFilterVariablesfastHandleNashapeSet
findAndTransformNumericsaggregateByKeywhichAreConstantsetAsNumericMatrix
setColAsCharactersetColAsFactorOrLogicalwhichAreInDouble
setColAsNumericwhichAreBijection
setColAsDatefastRound

All of those functions are integrated in the full pipeline function prepareSet.

For more details on how it work go check our tutorial

Getting started: 30 seconds to dataPreparation

Installation

Install the package from CRAN:

install.package("dataPreparation")

Install the package from github:

library(devtools)
install_github("ELToulemonde/dataPreparation")

Test it

Load a toy data set

library(dataPreparation)
data(messy_adult)
head(messy_adult)

Perform full pipeline function

clean_adult <- prepareSet(messy_adult)
head(clean_adult)

That's it. For all functions, you can check out documentation and/or tutorial vignette.

How to Contribute

dataPreparation has been developed and used by many active community members. Your help is very valuable to make it better for everyone.

  • Check out call for contributions to see what can be improved, or open an issue if you want something.
  • Contribute to add new usesfull features.
  • Contribute to the tests to make it more reliable.
  • Contribute to the documents to make it clearer for everyone.
  • Contribute to the examples to share your experience with other users.
  • Open issue if you met problems during development.

For more details, please refer to CONTRIBUTING.

Copy Link

Version

Install

install.packages('dataPreparation')

Monthly Downloads

898

Version

0.1

License

GPL-3 | file LICENSE

Maintainer

Emmanuel-Lin Toulemonde

Last Published

July 7th, 2017

Functions in dataPreparation (0.1)

fastFilterVariables

Filtering useless variables
fastHandleNa

Handle NA values
dateFormatUnifier

Unify dates format
diffDates

Date difference
messy_adult

adult with some ugly columns added
prepareSet

Preparation pipeline
adult

adult for UCI repository
aggregateByKey

Automatic dataSet aggregation by key
setAsNumericMatrix

Numeric matrix preparation for Machine Learning.
setColAsCharacter

Set columns as character
fastIsEqual

Fast checks of equality
fastRound

Fast round
setColAsNumeric

Set columns as numeric
shapeSet

Final preparation before ML algorithm
findAndTransformDates

Identify date columns in a dataSet set
findAndTransformNumerics

Identify numeric columns in a dataSet set
whichAreBijection

Identify bijections
whichAreConstant

Identify constant columns
setColAsDate

Set columns as POSIXct
setColAsFactorOrLogical

Set columns as factor
whichAreInDouble

Identify double columns
whichAreIncluded

Identify columns that are included in others