air_pollution_do_analysis: Comprehensive Air Pollution Analysis Pipeline

Description

Master function that runs the complete air pollution analysis including data loading, preprocessing (including lags), modeling, plotting, attribution calculations vs reference standards, power analysis and descriptive statistics

Usage

air_pollution_do_analysis(
  data_path,
  date_col = "date",
  region_col = "region",
  pm25_col = "pm25",
  deaths_col = "deaths",
  population_col = "population",
  humidity_col = "humidity",
  precipitation_col = "precipitation",
  tmax_col = "tmax",
  wind_speed_col = "wind_speed",
  categorical_others = NULL,
  continuous_others = NULL,
  Categorical_Others = NULL,
  Continuous_Others = NULL,
  max_lag = 14L,
  df_seasonal = 6,
  family = "quasipoisson",
  reference_standards = list(list(value = 15, name = "WHO")),
  output_dir = "air_pollution_results",
  save_outputs = TRUE,
  run_descriptive = TRUE,
  run_power = TRUE,
  moving_average_window = 3L,
  include_national = TRUE,
  years_filter = NULL,
  regions_filter = NULL,
  attr_thr = 95,
  plot_corr_matrix = TRUE,
  correlation_method = "pearson",
  plot_dist = TRUE,
  plot_na_counts = TRUE,
  plot_scatter = TRUE,
  plot_box = TRUE,
  plot_seasonal = TRUE,
  plot_regional = TRUE,
  plot_total = TRUE,
  detect_outliers = TRUE,
  calculate_rate = FALSE
)

Value

List containing:

data: Processed data with lag variables
meta_analysis: Meta-analysis results with AF/AN calculations
lag_analysis: Lag-specific analysis results
distributed_lag_analysis: Distributed lag model results (if requested)
plots: List of generated plots (forest, lags, distributed lags)
power_list: A list containing power information by area
exposure_response_plots: Exposure-response plots for each reference standard (if requested)
reference_specific_af_an: AF/AN calculations specific to each reference standard (if requested)
descriptive_stats: Summary statistics of key variables

Arguments

data_path: Character. Path to CSV data file
date_col: Character. Name of date column
region_col: Character. Name of region column
pm25_col: Character. Name of PM2.5 column
deaths_col: Character. Name of deaths column
population_col: Character. Name of the population column.
humidity_col: Character. Name of humidity column
precipitation_col: Character. Name of precipitation column
tmax_col: Character. Name of temperature column
wind_speed_col: Character. Name of wind speed column
categorical_others: Optional character vector. Names of additional categorical variables.
continuous_others: Optional character vector. Names of additional continuous variables (e.g., "tmean")
Categorical_Others: Deprecated alias for categorical_others.
Continuous_Others: Deprecated alias for continuous_others.
max_lag: Integer. Maximum lag days. Defaults to 14.
df_seasonal: Integer. Degrees of freedom for seasonal spline. Default 6.
family: Character. Character. Probability distribution for the outcome variable. Options include "quasipoisson" (default: "quasipoisson")
reference_standards: List of reference standards, each with "PM2.5 value" and "name of of standard (e.g. WHO)"
output_dir: Directory to save outputs
save_outputs: Logical. Whether to save outputs
run_descriptive: Logical. Whether to run descriptive statistics
run_power: Logical. Whether to run power analysis
moving_average_window: Integer. Window for moving average in descriptive stats
include_national: Logical. Whether to include national results in plots. Default TRUE.
years_filter: Optional numeric vector of years to include (e.g., c(2020, 2021, 2022)). It is recommended to filter for at least 3 consecutive years for a minimum considerable time series
regions_filter: Optional character vector of regions to include
attr_thr: Numeric (0-100). Percentile threshold used in power analysis to evaluate attribution detectability. Default 95.
plot_corr_matrix: Logical. Plot correlation matrix. Default TRUE.
correlation_method: Character. Correlation method for corr matrix (e.g.,"pearson", "spearman"). Default "pearson".
plot_dist: Logical. Plot distributions (hist/density) for key variables. Default TRUE.
plot_na_counts: Logical. Plot missingness/NA counts. Default TRUE.
plot_scatter: Logical. Plot scatter plots for key pairs. Default TRUE.
plot_box: Logical. Plot boxplots by region/season where applicable. Default TRUE.
plot_seasonal: Logical. Plot seasonal summaries. Default TRUE.
plot_regional: Logical. Plot regional summaries. Default TRUE.
plot_total: Logical. Plot overall totals where relevant. Default TRUE.
detect_outliers: Logical. Flag potential outliers in descriptive workflow. Default TRUE.
calculate_rate: Logical. Whether to calculate rate variables during descriptive stats (e.g., deaths per population). Default FALSE

Examples

Run this code

# \donttest{
example_data <- data.frame(
  date = seq.Date(as.Date("2020-01-01"), by = "day", length.out = 180),
  province = "Example Province",
  pm25 = stats::runif(180, 8, 35),
  deaths = stats::rpois(180, lambda = 5),
  population = 500000,
  humidity = stats::runif(180, 40, 90),
  precipitation = stats::runif(180, 0, 20),
  tmax = stats::runif(180, 18, 35),
  wind_speed = stats::runif(180, 1, 8)
)
example_path <- tempfile(fileext = ".csv")
utils::write.csv(example_data, example_path, row.names = FALSE)

results <- air_pollution_do_analysis(
  data_path = example_path,
  date_col = "date",
  region_col = "province",
  pm25_col = "pm25",
  deaths_col = "deaths",
  population_col = "population",
  humidity_col = "humidity",
  precipitation_col = "precipitation",
  tmax_col = "tmax",
  wind_speed_col = "wind_speed",
  continuous_others = NULL,
  max_lag = 7L,
  df_seasonal = 4,
  family = "quasipoisson",
  reference_standards = list(list(value = 15, name = "WHO")),
  years_filter = NULL,
  regions_filter = NULL,
  include_national = FALSE,
  output_dir = tempdir(),
  save_outputs = FALSE,
  run_descriptive = FALSE,
  run_power = FALSE,
  moving_average_window = 3L,
  attr_thr = 95,
  plot_corr_matrix = FALSE,
  correlation_method = "pearson",
  plot_dist = FALSE,
  plot_na_counts = FALSE,
  plot_scatter = FALSE,
  plot_box = FALSE,
  plot_seasonal = FALSE,
  plot_regional = FALSE,
  plot_total = FALSE,
  detect_outliers = FALSE,
  calculate_rate = FALSE
)
# }

Run the code above in your browser using DataLab