fastexplore: Explore and Summarize a Dataset Quickly

Description

fastexplore provides a fast and comprehensive exploratory data analysis (EDA) workflow. It automatically detects variable types, checks for missing and duplicated data, suggests potential ID columns, and provides a variety of plots (histograms, boxplots, scatterplots, correlation heatmaps, etc.). It also includes optional outlier detection, normality testing, and feature engineering.

Usage

fastexplore(
  data,
  label = NULL,
  visualize = c("histogram", "boxplot", "barplot", "heatmap", "scatterplot"),
  save_results = TRUE,
  output_dir = NULL,
  sample_size = NULL,
  interactive = FALSE,
  corr_threshold = 0.9,
  auto_convert_numeric = TRUE,
  visualize_missing = TRUE,
  imputation_suggestions = FALSE,
  report_duplicate_details = TRUE,
  detect_near_duplicates = TRUE,
  auto_convert_dates = FALSE,
  feature_engineering = FALSE,
  outlier_method = c("iqr", "zscore", "dbscan", "lof"),
  run_distribution_checks = TRUE,
  normality_tests = c("shapiro"),
  pairwise_matrix = TRUE,
  max_scatter_cols = 5,
  grouped_plots = TRUE,
  use_upset_missing = TRUE
)

Value

A (silent) list containing:

data_overview - A basic overview (head, unique values, skim summary).
summary_stats - Summary statistics for numeric columns.
freq_tables - Frequency tables for factor columns.
missing_data - Missing data overview (count, percentage).
duplicated_rows - Count of duplicated rows.
class_imbalance - Class distribution if label is provided and is categorical.
correlation_matrix - The correlation matrix for numeric variables.
zero_variance_cols - Columns with near-zero variance.
potential_id_cols - Columns with unique values in every row.
date_time_cols - Columns recognized as date/time.
high_corr_pairs - Pairs of variables with correlation above corr_threshold.
outlier_method - The chosen method for outlier detection.
outlier_summary - Outlier proportions or metrics (if computed).

If save_results = TRUE, additional side effects include saving figures, a correlation heatmap, and an R Markdown report in the specified directory.

Arguments

data: A data.frame. The dataset to analyze.
label: A character string specifying the name of the target or label column (optional). If provided, certain grouped plots and class imbalance checks will be performed.
visualize: A character vector specifying which visualizations to produce. Possible values: c("histogram", "boxplot", "barplot", "heatmap", "scatterplot").
save_results: Logical. If TRUE, saves plots and a rendered report (HTML) into a timestamped EDA_Results_ folder inside output_dir.
output_dir: A character string specifying the output directory for saving results (if save_results = TRUE). Defaults to current working directory.
sample_size: An integer specifying a random sample size for the data to be used in visualizations. If NULL, uses the entire dataset.
interactive: Logical. If TRUE, attempts to produce interactive Plotly heatmaps and other interactive elements. If required packages are not installed, falls back to static plots.
corr_threshold: Numeric. Threshold above which correlations (in absolute value) are flagged as high. Defaults to 0.9.
auto_convert_numeric: Logical. If TRUE, automatically converts factor/character columns that look numeric (only digits, minus sign, or decimal point) to numeric.
visualize_missing: Logical. If TRUE, attempts to visualize missingness patterns (e.g., via an UpSet plot, if UpSetR is available, or VIM, naniar).
imputation_suggestions: Logical. If TRUE, prints simple text suggestions for imputation strategies.
report_duplicate_details: Logical. If TRUE, shows top duplicated rows and their frequency.
detect_near_duplicates: Logical. Placeholder for near-duplicate (fuzzy) detection. Currently not implemented.
auto_convert_dates: Logical. If TRUE, attempts to detect and convert date-like strings (YYYY-MM-DD) to Date format.
feature_engineering: Logical. If TRUE, automatically engineers derived features (day, month, year) from any date/time columns, and identifies potential ID columns.
outlier_method: A character string indicating which outlier detection method(s) to apply. One of c("iqr", "zscore", "dbscan", "lof"). Only the first match will be used in the code (though the function is designed to handle multiple).
run_distribution_checks: Logical. If TRUE, runs normality tests (e.g., Shapiro-Wilk) on numeric columns.
normality_tests: A character vector specifying which normality tests to run. Possible values include "shapiro" or "ks" (Kolmogorov-Smirnov). Only used if run_distribution_checks = TRUE.
pairwise_matrix: Logical. If TRUE, produces a scatterplot matrix (using GGally) for numeric columns.
max_scatter_cols: Integer. Maximum number of numeric columns to include in the pairwise matrix.
grouped_plots: Logical. If TRUE, produce grouped histograms, violin plots, and density plots by label (if the label is a factor).
use_upset_missing: Logical. If TRUE, attempts to produce an UpSet plot for missing data if UpSetR is available.

Details

This function automates many steps of EDA:

Automatically detects numeric vs. categorical variables.
Auto-converts columns that look numeric (and optionally date-like).
Summarizes data structure, missingness, duplication, and potential ID columns.
Computes correlation matrix and flags highly correlated pairs.
(Optional) Outlier detection using IQR, Z-score, DBSCAN, or LOF methods.
(Optional) Normality tests on numeric columns.
Saves all results and an R Markdown report if save_results = TRUE.