Exploratory Data Analysis (EDA) is the initial and an important phase of data analysis/predictive modeling. During this process, analysts/modelers will have a first look of the data, and thus generate relevant hypotheses and decide next steps. However, the EDA process could be a hassle at times. This R package aims to automate most of data handling and visualization, so that users could focus on studying the data and extracting insights.
The package can be installed directly from CRAN.
However, the latest stable version (if any) could be found on GitHub, and installed using
if (!require(devtools)) install.packages("devtools") devtools::install_github("boxuancui/DataExplorer")
If you would like to install the latest development version, you may install the dev branch.
if (!require(devtools)) install.packages("devtools") devtools::install_github("boxuancui/DataExplorer", ref = "develop")
The package is extremely easy to use. Almost everything could be done in one line of code. Please refer to the package manuals for more information. You may also find the package vignettes here.
To get a report for the airquality dataset:
To get a report for the diamonds dataset with response variable price:
library(ggplot2) create_report(diamonds, y = "price")
You may also run all the plotting functions individually for your analysis, e.g.,
## View basic description for airquality data introduce(airquality) plot_intro(airquality) ## View missing value distribution for airquality data plot_missing(airquality) ## View distribution of all discrete variables plot_bar(diamonds) plot_bar(diamonds, with = "price") ## View distribution of all continuous variables plot_histogram(diamonds) plot_density(diamonds) ## View quantile-quantile plot of all continuous variables plot_qq(diamonds) plot_qq(diamonds, by = "cut") ## View overall correlation heatmap plot_correlation(diamonds) ## View bivariate continuous distribution based on `price` plot_boxplot(diamonds, by = "cut") ## Scatterplot `price` with all other continuous features plot_scatterplot(split_columns(diamonds)$continuous, by = "price", sampled_rows = 1000L) ## Visualize principal component analysis plot_prcomp(diamonds, maxcat = 5L)
To make quick updates to your data:
## Group bottom 20% `clarity` by frequency group_category(diamonds, feature = "clarity", threshold = 0.2, update = TRUE) ## Group bottom 20% `clarity` by `price` group_category(diamonds, feature = "clarity", threshold = 0.2, measure = "price", update = TRUE) ## Dummify diamonds dataset dummify(diamonds) dummify(diamonds, select = "cut") ## Set values for missing observations df <- data.frame("a" = rnorm(260), "b" = rep(letters, 10)) df[sample.int(260, 50), ] <- NA set_missing(df, list(0L, "unknown")) ## Update columns update_columns(airquality, c("Month", "Day"), as.factor) update_columns(airquality, 1L, function(x) x^2) ## Drop columns drop_columns(diamonds, 8:10) drop_columns(diamonds, "clarity")
See article wiki page.