Learn R Programming

⚠️There's a newer version (0.8.3) of this package.Take me there.

DataExplorer

master v0.8.0

develop v0.8.0.9000


Background

Exploratory Data Analysis (EDA) is the initial and an important phase of data analysis/predictive modeling. During this process, analysts/modelers will have a first look of the data, and thus generate relevant hypotheses and decide next steps. However, the EDA process could be a hassle at times. This R package aims to automate most of data handling and visualization, so that users could focus on studying the data and extracting insights.

Installation

The package can be installed directly from CRAN.

install.packages("DataExplorer")

However, the latest stable version (if any) could be found on GitHub, and installed using remotes package.

if (!require(devtools)) install.packages("devtools")
devtools::install_github("boxuancui/DataExplorer")

If you would like to install the latest development version, you may install the dev branch.

if (!require(devtools)) install.packages("devtools")
devtools::install_github("boxuancui/DataExplorer", ref = "develop")

Examples

The package is extremely easy to use. Almost everything could be done in one line of code. Please refer to the package manuals for more information. You may also find the package vignettes here.

Report

To get a report for the airquality dataset:

library(DataExplorer)
create_report(airquality)

To get a report for the diamonds dataset with response variable price:

library(ggplot2)
create_report(diamonds, y = "price")

Visualization

You may also run all the plotting functions individually for your analysis, e.g.,

## View basic description for airquality data
introduce(airquality)
plot_intro(airquality)

## View missing value distribution for airquality data
plot_missing(airquality)

## View distribution of all discrete variables
plot_bar(diamonds)
plot_bar(diamonds, with = "price")

## View distribution of all continuous variables
plot_histogram(diamonds)
plot_density(diamonds)

## View quantile-quantile plot of all continuous variables
plot_qq(diamonds)
plot_qq(diamonds, by = "cut")

## View overall correlation heatmap
plot_correlation(diamonds)

## View bivariate continuous distribution based on `price`
plot_boxplot(diamonds, by = "cut")
	
## Scatterplot `price` with all other continuous features
plot_scatterplot(split_columns(diamonds)$continuous, by = "price", sampled_rows = 1000L)

## Visualize principal component analysis
plot_prcomp(diamonds, maxcat = 5L)

Feature Engineering

To make quick updates to your data:

## Group bottom 20% `clarity` by frequency
group_category(diamonds, feature = "clarity", threshold = 0.2, update = TRUE)

## Group bottom 20% `clarity` by `price`
group_category(diamonds, feature = "clarity", threshold = 0.2, measure = "price", update = TRUE)

## Dummify diamonds dataset
dummify(diamonds)
dummify(diamonds, select = "cut")

## Set values for missing observations
df <- data.frame("a" = rnorm(260), "b" = rep(letters, 10))
df[sample.int(260, 50), ] <- NA
set_missing(df, list(0L, "unknown"))

## Update columns
update_columns(airquality, c("Month", "Day"), as.factor)
update_columns(airquality, 1L, function(x) x^2)

## Drop columns
drop_columns(diamonds, 8:10)
drop_columns(diamonds, "clarity")

Articles

See article wiki page.

Copy Link

Version

Install

install.packages('DataExplorer')

Monthly Downloads

7,250

Version

0.8.0

License

MIT + file LICENSE

Maintainer

Boxuan Cui

Last Published

March 17th, 2019

Functions in DataExplorer (0.8.0)

configure_report

Configure report template
group_category

Group categories for discrete features
plot_histogram

Plot histogram
plot_boxplot

Create boxplot for continuous features
plotDataExplorer.grid

Plot objects with gridExtra
.lapply

Parallelization
create_report

Create report
.getPageLayout

Calculate page layout index
plot_correlation

Create correlation heatmap for discrete features
plot_intro

Plot introduction
plot_missing

Plot missing value profile
plotDataExplorer.multiple

Plot multiple objects
.ignoreCat

Truncate category
.getAllMissing

Get all missing columns
profile_missing

Profile missing values
plot_prcomp

Visualize principal component analysis
drop_columns

Drop selected variables
plot_qq

Plot QQ plot
set_missing

Set all missing values to indicated value
introduce

Describe basic information
plotDataExplorer.single

Plot single object
plot_bar

Plot bar chart
plotDataExplorer

Default DataExplorer plotting function
split_columns

Split data into discrete and continuous parts
update_columns

Update variable types or values
plot_scatterplot

Create scatterplot for all features
plot_str

Visualize data structure
DataExplorer-package

Data Explorer
dummify

Dummify discrete features to binary columns
plot_density

Plot density estimates