Learn R Programming

⚠️There's a newer version (1.3.5) of this package.Take me there.

explore

Simplifies Exploratory Data Analysis.

Why this package?

  • Faster insights with less code for experienced R users. Exploring a fresh new dataset is exciting. Instead of searching for syntax at Stackoverflow, use all your attention searching for interesting patterns in your data, using just a handful easy to remember functions. Your code is easy to understand - even for non R users.

  • Instant success for new R users. It is said that R has a steep learning curve, especially if you come from a GUI for your statistical analysis. Instead of learning a lot of R syntax before you can explore data, the explore package enables you to have instant success. You can start with just one function - explore() - and learn other R syntax later step by step.

How to use it

There are three ways to use the package:

  • Interactive data exploration (univariat, bivariat, multivariat). A target can be defined (binary / categorical / numerical).

  • Generate an Automated Report with one line of code. The target can be binary, categorical or numeric.

  • Manual exploration using a easy to remember set of tidy functions. There are basically four "verbs" to remember:

    • explore - if you want to explore a table, a variable or the relationship between a variable and a target (binary, categorical or numeric). The output of these functions is a plot.

    • describe - if you want to describe a dataset or a variable (number of na, unique values, ...) The output of these functions is a text.

    • explain - to create a simple model that explains a target. explain_tree() for a decision tree, explain_logreg() for a logistic regression.

    • report - to generate an automated report of all variables. A target can be defined (binary, categorical or numeric)

The explore package automatically checks if an attribute is categorial or numerical, chooses the best plot-type and handles outliers (autosacling).

You can use {explore} with tidy data (each row is an observation) or with count data (each row is a group of observations with same attributes, one variable stores the number of observations). To use count data, you need to add the n parameter (variable containing the number of observations). Not all functions support count data.

Installation

CRAN

install.packages("explore")

DEV version (github)

# install from github
if (!require(devtools)) install.packages("devtools")
devtools::install_github("rolkra/explore")

if you are behind a firewall, you may want to:

  • Download and unzip the explore package
  • Then install it with devtools::install_local
# install local
if (!require(devtools)) install.packages("devtools")
devtools::install_local(path = <path of local package>, force = TRUE)

Examples

Interactive data exploration

Example how to use the explore package to explore the iris dataset

# load package
library(explore)

# explore interactive
explore(iris)

Explore variables

Explore variables with target

Explain target (Decision Tree)

Automated Report

Create a report by clicking the "report all" button or use the report() function. If no target is defined, the report shows all variables. If a target is defined, the report shows the relation between all variables and the target.

Report of all variables

iris |> report(output_dir = tempdir())

To create a report that shows all variables in relation to a target, just add the target parameter

iris |> report(output_dir = tempdir(), target = Species)

To create a report with a binary target you can use the parameter targetpct = TRUE (or split = FALSE)

# define a target (is Species versicolor?)
iris$is_versicolor <- ifelse(iris$Species == "versicolor", 1, 0)
iris$Species <- NULL

# create report
iris |> report(output_dir = tempdir(),
                target = is_versicolor,
                targetpct = TRUE)

Manual exploration

Example how to use the functions of the explore package to explore tidy data (each row is an observation) like the iris dataset:

# load packages
library(explore)

# use iris dataset
data(iris)

# explore Species
iris |> explore(Species)

# explore Sepal.Length
iris |> explore(Sepal.Length)

# define a target (is Species versicolor?)
iris$is_versicolor <- ifelse(iris$Species == "versicolor", 1, 0)

# explore relationship between Sepal.Length and the target
iris |> explore(Sepal.Length, target = is_versicolor)

# explore relationship between all variables and the target
iris |> explore_all(target = is_versicolor)

# explore correlation between Sepal.Length and Petal.Length
iris |> explore(Sepal.Length, Petal.Length)

# explore correlation between Sepal.Length, Petal.Length and a target
iris |> explore(Sepal.Length, Petal.Length, target = is_versicolor)

# describe dataset
describe(iris)

# describe Species
iris |> describe(Species)

# explain target using a decision tree
iris$Species <- NULL
iris |> explain_tree(target = is_versicolor)

# explain target using a logistic regression
iris |> explain_logreg(target = is_versicolor)

Example how to use the functions of the explore package to explore count-data (each row is a group of observations):

# load packages
library(tibble)
library(explore)

# use titanic dataset
# n = number of observations
titanic <- as_tibble(Titanic)

# describe data
describe(titanic)

# describe Class
titanic |> describe(Class, n = n)

# explore Class
titanic |> explore(Class, n = n)

# explore relationship between Class and the target
titanic |> explore(Class, n = n, target = Survived)

# explore relationship between all variables and the target
titanic |> explore_all(n = n, target = Survived)

# explain target using a decision tree
titanic |> explain_tree(n = n, target = Survived)

Some other useful functions:

# create dataset and explore it
data <- create_data_app(obs = 1000)
explore(data)

data <- create_data_buy(obs = 1000)
explore(data)

data <- create_data_churn(obs = 1000)
explore(data)

data <- create_data_person(obs = 1000)
explore(data)

data <- create_data_unfair(obs = 1000)
explore(data)

# create random dataset with 100 observarions and 5 random variables
# and explore it
data <- create_data_random(obs = 100, vars = 5)
explore(data)

# create your own random dataset and explore it
data <- create_data_empty(obs = 1000) |> 
  add_var_random_01("target") |> 
  add_var_random_dbl("age", min_val = 18, max_val = 80) |> 
  add_var_random_cat("gender", 
                     cat = c("male", "female", "other"), 
                     prob = c(0.4, 0.4, 0.2)) |> 
  add_var_random_starsign() |> 
  add_var_random_moon()
  
explore(data)

# create an RMarkdown template to explore your own data
# set output_dir (existing file may be overwritten)
create_notebook_explore(
  output_dir = tempdir(),
  output_file = "notebook-explore.Rmd")

Copy Link

Version

Install

install.packages('explore')

Monthly Downloads

1,409

Version

1.0.1

License

GPL-3

Issues

Pull Requests

Stars

Forks

Maintainer

Roland Krasser

Last Published

December 20th, 2022

Functions in explore (1.0.1)

decrypt

decrypt text
create_data_person

Create data person
dwh_disconnect

disconnect from DWH
create_data_random

Create data random
dwh_fastload

write data to a DWH table
describe_num

Describe numerical variable
describe_cat

Describe categorical variable
explore_cor

Explore the correlation between two variables
add_var_random_moon

Add a random moon variable to dataset
format_num_space

Format number as character string (space as big.mark)
explore_count

Explore count data (categories + frequency)
explore_targetpct

Explore variable + binary target (values 0/1)
create_data_unfair

Create data unfair
explore_tbl

Explore table
create_data_app

Create data app
add_var_random_01

Add a random 0/1 variable to dataset
add_var_random_starsign

Add a random starsign variable to dataset
create_notebook_explore

Generate a notebook
balance_target

Balance target variable
explain_tree

Explain a target using a simple decision tree (classification or regression)
describe_tbl

Describe table
create_data_churn

Create data churn
create_data_empty

Create an empty dataset
explore

Explore a dataset or variable
guess_cat_num

Return if variable is categorial or nomerical
format_target

Format target
describe

Describe a dataset or variable
create_data_buy

Create data buy
plot_legend_targetpct

Plots a legend that can be used for explore_all with a binary target
dwh_read_data

read data from DWH
dwh_read_table

read a table from DWH
describe_all

Describe all variables of a dataset
explore_shiny

Explore dataset interactive
explore_density

Explore density of variable
explore_all

Explore all variables
dwh_connect

connect to DWH
format_num_auto

Format number as character string (auto)
explain_logreg

Explain a binary target using a logistic regression (glm). Model chosen by AIC in a Stepwise Algorithm (MASS::stepAIC).
get_type

Return type of variable
format_num_kMB

Format number as character string (kMB)
encrypt

encrypt text
format_type

Format type description
get_nrow

Get number of rows for a grid plot (deprecated, use total_fig_height() instead)
replace_na_with

Replace NA
explore_bar

Explore categorical variable using bar charts
get_var_buckets

Put variables into "buckets" to create a set of plots instead one large plot
target_explore_cat

Explore categorical variable + target
rescale01

Rescales a numeric variable into values between 0 and 1
plot_var_info

Plot a variable info
plot_text

Plot a text
report

Generate a report of all variables
simplify_text

Simplifies a text string
total_fig_height

Get fig.height for RMarkdown-junk using explore_all()
weight_target

Weight target variable
target_explore_num

Explore categorical variable + target
add_var_id

Add a variable id at first column in dataset
count_pct

Adds percentage to dplyr::count()
clean_var

Clean variable
add_var_random_cat

Add a random categorical variable to dataset
data_dict_md

Create a data dictionary Markdown file
add_var_random_dbl

Add a random double variable to dataset
add_var_random_int

Add a random integer variable to dataset