Learn R Programming

fastml (version 0.5.0)

fastexplore: Explore and Summarize a Dataset Quickly

Description

fastexplore provides a fast and comprehensive exploratory data analysis (EDA) workflow. It automatically detects variable types, checks for missing and duplicated data, suggests potential ID columns, and provides a variety of plots (histograms, boxplots, scatterplots, correlation heatmaps, etc.). It also includes optional outlier detection, normality testing, and feature engineering.

Usage

fastexplore(
  data,
  label = NULL,
  visualize = c("histogram", "boxplot", "barplot", "heatmap", "scatterplot"),
  save_results = TRUE,
  output_dir = NULL,
  sample_size = NULL,
  interactive = FALSE,
  corr_threshold = 0.9,
  auto_convert_numeric = TRUE,
  visualize_missing = TRUE,
  imputation_suggestions = FALSE,
  report_duplicate_details = TRUE,
  detect_near_duplicates = TRUE,
  auto_convert_dates = FALSE,
  feature_engineering = FALSE,
  outlier_method = c("iqr", "zscore", "dbscan", "lof"),
  run_distribution_checks = TRUE,
  normality_tests = c("shapiro"),
  pairwise_matrix = TRUE,
  max_scatter_cols = 5,
  grouped_plots = TRUE,
  use_upset_missing = TRUE
)

Value

A (silent) list containing:

  • data_overview - A basic overview (head, unique values, skim summary).

  • summary_stats - Summary statistics for numeric columns.

  • freq_tables - Frequency tables for factor columns.

  • missing_data - Missing data overview (count, percentage).

  • duplicated_rows - Count of duplicated rows.

  • class_imbalance - Class distribution if label is provided and is categorical.

  • correlation_matrix - The correlation matrix for numeric variables.

  • zero_variance_cols - Columns with near-zero variance.

  • potential_id_cols - Columns with unique values in every row.

  • date_time_cols - Columns recognized as date/time.

  • high_corr_pairs - Pairs of variables with correlation above corr_threshold.

  • outlier_method - The chosen method for outlier detection.

  • outlier_summary - Outlier proportions or metrics (if computed).

If save_results = TRUE, additional side effects include saving figures, a correlation heatmap, and an R Markdown report in the specified directory.

Arguments

data

A data.frame. The dataset to analyze.

label

A character string specifying the name of the target or label column (optional). If provided, certain grouped plots and class imbalance checks will be performed.

visualize

A character vector specifying which visualizations to produce. Possible values: c("histogram", "boxplot", "barplot", "heatmap", "scatterplot").

save_results

Logical. If TRUE, saves plots and a rendered report (HTML) into a timestamped EDA_Results_ folder inside output_dir.

output_dir

A character string specifying the output directory for saving results (if save_results = TRUE). Defaults to current working directory.

sample_size

An integer specifying a random sample size for the data to be used in visualizations. If NULL, uses the entire dataset.

interactive

Logical. If TRUE, attempts to produce interactive Plotly heatmaps and other interactive elements. If required packages are not installed, falls back to static plots.

corr_threshold

Numeric. Threshold above which correlations (in absolute value) are flagged as high. Defaults to 0.9.

auto_convert_numeric

Logical. If TRUE, automatically converts factor/character columns that look numeric (only digits, minus sign, or decimal point) to numeric.

visualize_missing

Logical. If TRUE, attempts to visualize missingness patterns (e.g., via an UpSet plot, if UpSetR is available, or VIM, naniar).

imputation_suggestions

Logical. If TRUE, prints simple text suggestions for imputation strategies.

report_duplicate_details

Logical. If TRUE, shows top duplicated rows and their frequency.

detect_near_duplicates

Logical. Placeholder for near-duplicate (fuzzy) detection. Currently not implemented.

auto_convert_dates

Logical. If TRUE, attempts to detect and convert date-like strings (YYYY-MM-DD) to Date format.

feature_engineering

Logical. If TRUE, automatically engineers derived features (day, month, year) from any date/time columns, and identifies potential ID columns.

outlier_method

A character string indicating which outlier detection method(s) to apply. One of c("iqr", "zscore", "dbscan", "lof"). Only the first match will be used in the code (though the function is designed to handle multiple).

run_distribution_checks

Logical. If TRUE, runs normality tests (e.g., Shapiro-Wilk) on numeric columns.

normality_tests

A character vector specifying which normality tests to run. Possible values include "shapiro" or "ks" (Kolmogorov-Smirnov). Only used if run_distribution_checks = TRUE.

pairwise_matrix

Logical. If TRUE, produces a scatterplot matrix (using GGally) for numeric columns.

max_scatter_cols

Integer. Maximum number of numeric columns to include in the pairwise matrix.

grouped_plots

Logical. If TRUE, produce grouped histograms, violin plots, and density plots by label (if the label is a factor).

use_upset_missing

Logical. If TRUE, attempts to produce an UpSet plot for missing data if UpSetR is available.

Details

This function automates many steps of EDA:

  1. Automatically detects numeric vs. categorical variables.

  2. Auto-converts columns that look numeric (and optionally date-like).

  3. Summarizes data structure, missingness, duplication, and potential ID columns.

  4. Computes correlation matrix and flags highly correlated pairs.

  5. (Optional) Outlier detection using IQR, Z-score, DBSCAN, or LOF methods.

  6. (Optional) Normality tests on numeric columns.

  7. Saves all results and an R Markdown report if save_results = TRUE.