eyeris_db_to_parquet: Split eyeris database into N parquet files by data type

Description

Utility function that takes an eyerisdb DuckDB database and splits it into N reasonably sized parquet files for easy management with GitHub, downloading, and distribution. Data is first grouped by table type (timeseries, epochs, events, etc.) since each has different columnar structures, then each group is split into the specified number of files. Files are organized in folders matching the database name for easy identification.

Usage

eyeris_db_to_parquet(
  bids_dir,
  db_path = "my-project",
  n_files_per_type = 1,
  output_dir = NULL,
  max_file_size = 512,
  data_types = NULL,
  verbose = TRUE,
  include_metadata = TRUE,
  epoch_labels = NULL,
  group_by_epoch_label = TRUE
)

Value

List containing information about created parquet files

Arguments

bids_dir: Path to the BIDS directory containing the database
db_path: Database name (defaults to "my-project", becomes "my-project.eyerisdb")
n_files_per_type: Number of parquet files to create per data type (default: 1)
output_dir: Directory to save parquet files (defaults to bids_dir/derivatives/parquet)
max_file_size: Maximum file size in MB per parquet file (default: 512) Used as a constraint when n_files_per_type would create files larger than this
data_types: Vector of data types to include. If NULL (default), includes all available. Valid types: "timeseries", "epochs", "epoch_summary", "events", "blinks", "confounds_*"
verbose: Whether to print progress messages (default: TRUE)
include_metadata: Whether to include eyeris metadata columns in output (default: TRUE)
epoch_labels: Optional character vector of epoch labels to include (e.g., "prepostprobe"). Only applies to epoch-related data types. If NULL, includes all labels.
group_by_epoch_label: If TRUE, processes epoch-related data types separately by epoch label to reduce memory footprint and produce label-specific parquet files (default: TRUE).

Database Safety

This function creates temporary tables during parquet export when the arrow package is not available. All temporary tables are automatically cleaned up, but if the process crashes, leftover tables may remain. The function checks for and warns about existing temporary tables before starting.

Examples

Run this code

# \donttest{
# create demo database
demo_data <- eyelink_asc_demo_dataset()
demo_data |>
  eyeris::glassbox() |>
  eyeris::epoch(
    events = "PROBE_{startstop}_{trial}",
    limits = c(-1, 1),
    label = "prePostProbe"
  ) |>
  eyeris::bidsify(
    bids_dir = tempdir(),
    participant_id = "001",
    session_num = "01",
    task_name = "memory",
    db_enabled = TRUE,
    db_path = "memory-task"
  )

# split into 3 parquet files per data type - creates memory-task/ folder
split_info <- eyeris_db_to_parquet(
  bids_dir = tempdir(),
  db_path = "memory-task",
  n_files_per_type = 3
)

# split with size constraint and specific data types using the same database
split_info <- eyeris_db_to_parquet(
  bids_dir = tempdir(),
  db_path = "memory-task",
  n_files_per_type = 5,
  max_file_size = 50,  # max 50MB per file
  data_types = c("timeseries", "epochs", "events")
)
# }

Run the code above in your browser using DataLab