Learn R Programming

Overview

Tools for assessing data quality, performing exploratory analysis, and semi-automatic preprocessing of messy data with change tracking for integral dataset cleaning.

Installation

# The stable version of clickR can be installed from CRAN with:
install.packages("clickR")

# Alternatively, the development version can be installed from Github:
remotes::install_github("David-hervas/clickR")

Usage

clickR functions are divided in three groups:

  1. Functions for assessing data quality and performing exploratory analyses:

peek() descriptive() cluster_var() mine.plot() outliers() bivariate_outliers()

  1. Functions for detecting and correcting errors:

nice_names() fix_factors() fix_dates() fix_numerics() fix_levels() fix_NA() fix_concat() remove_empty()

  1. Functions for reviewing and, potentially, restoring the changes applied to the data.

track_changes() restore_changes() manual_fix()

Each function has its corresponding help page, which can be accessed by the standard procedure in R: typing ?name_of_function in the console or, in R Studio, by clicking on the function name and pressing F1.

A simplified usage procedure would be:

# Explore the data with some of the exploratory analysis functions:
descriptive(mtcars_messy)

# Data frame with 32 observations and 13 variables.

# Numeric variables (7)
#           Min  1st Q. Median 3rd Q.   Max    Mean      SD Kurtosis Skewness Modes NAs                   Distribution
# cyl       4.0   4.000   6.00    8.0   8.0   6.188   1.786   -1.762   -0.175    2    0 [#############:##############]
# disp     71.1 120.825 196.30  326.0 472.0 230.722 123.939   -1.207    0.382    2    0 |---[####:########]----------|
# hp       52.0  96.500 123.00  180.0 335.0 146.688  68.563   -0.136    0.726    1    0 |----[#:#####]---------------|
# qsec     14.5  16.892  17.71   18.9  22.9  17.849   1.787    0.335    0.369    1    0 |-------[##:###]-------------|
# am        0.0   0.000   0.00    1.0   1.0   0.406   0.499   -1.925    0.364    2    0 :############################]
# nº Gears  3.0   3.000   4.00    4.0   5.0   3.688   0.738   -1.070    0.529    3    0 [#############:--------------|
# carb      1.0   2.000   2.00    4.0   8.0   2.812   1.615    1.257    1.051    2    0 |---:#######]----------------|
#
# Categorical variables (6)
#       N. Classes              Classes       Mode Prop. mode  Anti-mode Prop. Anti-mode NAs
# Mpg           28 10.4/15.2/21/30.4/19       10.4      0.062       19.2           0.031   0
# drat          22 3.07/3.92/2.76/3.08/       3.07      0.094          -           0.031   0
# wt            29 3.44/3.57/1.513/1.61       3.44      0.094      1.513           0.031   0
# vs             4           0/1/?/NULL          0        0.5          ?           0.031   0
# date          32 06/25/1974/12/22/73/ 06/25/1974      0.031 06/25/1974           0.031   0
# maker         26 Mrc/Ft/AMC/Cdllc/Cmr       Merc      0.188        AMC           0.031   0

# Use some of the fix functions to correct errors:
mtcars_messy <- fix_NA(mtcars_messy)
mtcars_messy <- fix_dates(mtcars_mesy)
mtcars_messy <- fix_numerics(mtcars_messy)

# Review the performed changes
track_changes(mtcars_messy)

# variable         observation  original     new          fun
#      drat          Duster 360              <NA>       fix_NA
#      drat         Honda Civic         -    <NA>       fix_NA
#        vs  Cadillac Fleetwood         ?    <NA>       fix_NA
#        vs       Porsche 914-2      NULL    <NA>       fix_NA
#       Mpg                 all character numeric fix_numerics
#      drat                 all character numeric fix_numerics
#        wt                 all character numeric fix_numerics
#       Mpg          Datsun 710      22,8    22.8 fix_numerics
#       Mpg      Hornet 4 Drive     21.,4    21.4 fix_numerics
#       Mpg          Duster 360  14.3 mpg    14.3 fix_numerics
#       Mpg            Merc 280      19.2    19.2 fix_numerics
#       Mpg           Merc 280C   1.78e01    17.8 fix_numerics
#        wt Lincoln Continental     5,424   5.424 fix_numerics
#        wt   Chrysler Imperial     5,345   5.345 fix_numerics

# Id needed, restore the unwanted changes:
mtcars_messy <- restore_changes(track_changes(mtcars_messy, fun == "fix_numerics" & variable == "wt"))

Parallelization of data-cleaning tasks

New versions of clickR provide parallelization for some of the data-cleaning functions via the future package.

library(future)
plan(multisession(workers=2))  #Set number of workers
fix_dates(mtcars_messy)

This functionality still needs some optimization. So be aware that, in some (rare) specific cases, parallelized tasks might take more time than non-parallelized tasks.

Copy Link

Version

Install

install.packages('clickR')

Monthly Downloads

272

Version

0.9.45

License

GPL (>= 2)

Maintainer

David Hervas

Last Published

December 5th, 2024

Functions in clickR (0.9.45)

%<NA%

less & NA
fxd

Internal function to fix_dates
remove_empty

remove_empty
restore_changes

Restore changes
nice_names

Nice names
%<=NA%

leq & not NA
outliers

outliers
prop_min

Gets proportion of least repeated value
scale_01

Scales data between 0 and 1
numeros

Brute numeric coercion
%>NA%

greater & NA
unforge

Un-Forge
ttrue

True TRUE
%>=NA%

geq & not NA
v_df_changes

Internal function to track_changes
workspace

Explores global environment workspace
kurtosis

Computes kurtosis
mtapply

Multiple tapply
ipboxplot

Improved boxplot
mtcars_messy

Messy Motor Trend Car Road Tests Dataset
kill.factors

Kill factors
good2go

Good to go
may.numeric

Checks if each value might be numeric
manual_fix

Tracked manual fixes to data
mine.plot

Mine plot
moda

Get mode
moda_cont

Estimates number of modes
prop_may

Gets proportion of most repeated value
peek

Peek
skewness

Computes skewness
search_scripts

Search scripts
workspace_sapply

Applies a function over objects of a specific class
text_date

Internal function for dates with text
track_changes

track_changes
xscores

Estimate sample scores
GK_assoc

Computes Goodman and Kruskal's tau
cluster_var

Clustering of variables
antimoda

Get anti-mode
fix_NA

fix_NA
forge

Forge
bivariate_outliers

Check for bivariate outliers
extreme_values

Extreme values from a numeric vector
fix_factors

Fix factors imported as numerics
check_quality

Checks data quality of a variable
fix_levels

Fix levels
f_replace

Find and replace
%between%

between operator
descriptive

Detailed summary of the data
fix_all

fix_all
fix_concat

fix_concat
fix_dates

Fix dates
nearest

Internal function for descriptive()
fix_numerics

Fix numeric data
%betweenNA%

between operator & not NA