PULSE: Process PULSE data from a single experiment (`STEPS 1-6`)

Description

ALL STEPS EXECUTED SEQUENTIALLY

step 1 -- pulse_read()
step 2 -- pulse_split()
step 3 -- pulse_optimize()
step 4 -- pulse_heart()
step 5 -- pulse_doublecheck()
step 6 -- pulse_choose_keep()
extra step -- pulse_normalize()
extra step -- pulse_summarise()
visualization -- pulse_plot() and pulse_plot_raw()

This is a wrapper function that provides a shortcut to running all 6 steps of the PULSE multi-channel data processing pipeline in sequence, namely pulse_read() >> pulse_split() >> pulse_optimize() >> pulse_heart() >> pulse_doublecheck() >> pulse_choose_keep().

Please note that the heartbeatr package is designed specifically for PULSE systems commercialized by the non-profit co-op ElectricBlue (https://electricblue.eu/pulse) and is likely to fail if data from any other system is used as input without matching file formatting.

PULSE() takes a vector of paths to PULSE csv files produced by a PULSE system during a single experiment (either multi-channel or one-channel, but never both at the same time) and automatically computes the heartbeat frequencies in all target channels across use-defined time windows. The entire workflow may take less than 5 minutes to run on a small dataset (a few hours of data) if params are chosen with speed in mind and the code is run on a modern machine. Conversely, large datasets (spanning several days) may take hours or even days to run. In extreme situations, datasets may be too large for the machine to handle (due to memory limitations), and it may be better to process batches at a time (check PULSE_by_chunks() and consider implementing a parallel computing strategy).

Usage

PULSE(
  paths,
  window_width_secs = 30,
  window_shift_secs = 60,
  min_data_points = 0.8,
  interpolation_freq = 40,
  bandwidth = 0.2,
  doublecheck = TRUE,
  lim_n = 3,
  lim_sd = 0.75,
  raw_v_smoothed = TRUE,
  correct = TRUE,
  discard_channels = NULL,
  keep_raw_data = TRUE,
  subset = 0,
  subset_seed = NULL,
  subset_reindex = FALSE,
  process_large = FALSE,
  show_progress = TRUE,
  max_dataset_size = 20
)

Value

A tibble with nrows = (number of channels) * (number of windows in pulse_data_split) and 13 columns:

i, the order of each time window
smoothed, logical flagging smoothed data
id, PULSE channel IDs
time, time at the center of each time window
data, a list of tibbles with raw PULSE data for each combination of channel and window, with columns time, val and peak (TRUE in rows corresponding to wave peaks)
hz, heartbeat rate estimate (in Hz)
n, number of wave peaks identified
sd, standard deviation of the intervals between wave peaks
ci, confidence interval (hz ± ci)
keep, logical indicating whether data points meet N and SD criteria
d_r, ratio of consecutive asymmetric peaks
d_f, logical flagging data points where heart beat frequency is likely double the real value

Arguments

paths: character vectors, containing file paths to CSV files produced by a PULSE system during a single experiment.
window_width_secs: numeric, in seconds, defaults to 30; the width of the time windows over which heart rate frequency will be computed.
window_shift_secs: numeric, in seconds, defaults to 60; by how much each subsequent window is shifted from the preceding one.
min_data_points: numeric, defaults to 0.8; decimal from 0 to 1, used as a threshold to discard incomplete windows where data is missing (e.g., if the sampling frequency is 20 and window_width_secs = 30, each window should include 600 data points, and so if min_data_points = 0.8, windows with less than 600 * 0.8 = 480 data points will be rejected).
interpolation_freq: numeric, defautls to 40; value expressing the frequency (in Hz) to which PULSE data should be interpolated. Can be set to 0 (zero) or any value equal or greater than 40 (the default). If set to zero, no interpolation is performed.
bandwidth: numeric, defaults to 0.2; the bandwidth for the Kernel Regression Smoother. If equal to 0 (zero) no smoothing is applied. Normally kept low (0.1 - 0.3) so that only very high frequency noise is removed, but can be pushed up all the way to 1 or above (especially when the heartbeat rate is expected to be slow, as is typical of oysters, but double check the resulting data). Type ?ksmooth for additional info.
doublecheck: logical, defaults to TRUE; should pulse_doublecheck() be used? (it is rare, but there are instances when it should be disabled).
lim_n: numeric, defaults to 3; minimum number of peaks detected in each time window for it to be considered a "keep".
lim_sd: numeric, defaults to 0.75; maximum value for the sd of the time intervals between each peak detected for it to be considered a "keep"
raw_v_smoothed: logical, defaults to TRUE; indicates whether or not to also compute heart rates before applying smoothing; this will increase the quality of the output but also double the processing time.
correct: logical, defaults to TRUE; if FALSE, data points with hz values likely double the real value are flagged BUT NOT CORRECTED. If TRUE, hz (as well as data, n, sd and ci) are corrected accordingly. Note that the correction is not reversible!
discard_channels: character vectors, containing the names of channels to be discarded from the analysis. discard_channels is forced to lowercase, but other than that, the exact names must be provided. Discarding unused channels can greatly speed the workflow!
keep_raw_data: logical, defaults to TRUE; If set to FALSE, $data is set to FALSE (i.e., raw data is discarded), dramatically reducing the amount of disk space required to store the final output (usually, by two orders of magnitude). HOWEVER, note that it won't be possible to use pulse_plot_raw() anymore!
subset: numerical, defaults to 0; the number of time windows to keep from the entire dataset (or the number of entries to reject if set to a negative value); smaller subsets make the entire processing quicker and facilitate the execution of trial runs to optimize parameter selection before processing the entire dataset.
subset_seed: numerical, defaults to NULL; only used if subset is different from 0; subset_seed controls the seed used when extracting a subset of the available data; if set to NULL, a random seed is selected, resulting in rows being selected randomly; alternativelly, the user can set a specific seed in order to always select the same rows (important when the goal is to compare the impact of different parameter combinations using the exact same data points).
subset_reindex: logical, defaults to FALSE; only used if subset is different from 0; after extracting a subset of the available data, should rows be re-indexed (i.e., .$i made fully sequential); re-indexed rows make using pulse_plot_raw() easier, but row identity doesn't match anymore with row identity before subsetting.
process_large: logical, defaults to FALSE; If set to FALSE, if the dataset used as input is large (i.e., combined file size greater than 20 MB, which is equivalent to three files each with a full hour of PULSE data), PULSE will not process the data and instead suggest the use of PULSE_by_chunks(), which is designed to handle large datasets; If set to TRUE, PULSE will proceed with the attempt to process the dataset, but the system's memory may become overloaded and R may never finish the job.
show_progress: logical, defaults to FALSE. If set to TRUE, progress messages will be provided.
max_dataset_size: numeric, defaults to 21. Corresponds to the maximum combined size (in Mb) that the dataset contained by the files in paths can be when process_large is set to FALSE. If that is the case, data processing will be aborted with a message explaining the remedies possible. This is a fail-safe to prevent PULSE from being asked to process a dataset that is larger than the user's machine can handle, a situation that typically leads to a stall (R doesn't fail, it just keeps trying without any progress being made). A conservative value of 21 will allow only a little more than 3 hours-worth of data to be processed (a PULSE csv file with 1 hour of data typically takes up to 7 Mb). If the machine has a large amount of RAM available, a higher value can be used. Alternatively, consider using the function PULSE_by_chunks() instead.

One experiment

The heartbeatr workflow must be applied to a single experiment each time. By experiment we mean a collection of PULSE data where all the relevant parameters are invariant, including (but not limited):

the version of the firmware installed in the PULSE device (multi-channel or one-channel)
the names of all channels (including unused channels)
the frequency at which data was captured

Note also that even if two PULSE systems have been used in the same scientific experiment, data from each device must be processed independently, and only merged at the end. There's no drawback in doing so, it just is important to understand that that's how data must be processed by the heartbeatr-package.

Normalizing and summarising data

Both pulse_normalize() and pulse_summarise() aren't included in PULSE() because they aren't essential for the PULSE data processing pipeline and the choosing of values for their parameters require an initial look at the data. However, it is very often crucial to normalize the heart rate estimates produced so that comparisons across individuals can more reliably be made, and it also often important to manage the amount of data points produced before running statistical analyses on the data to avoid oversampling, meaning that users should consider running the output from PULSE() though both these functions before considering the data as fully processed and ready for subsequent analysis. Check both functions for additional details on their role on the entire processing pipeline (?pulse_normalize and ?pulse_summarise).

Additional details

Check the help files of the underlying functions to obtain additional details about each of the steps implemented under PULSE(), namely:

pulse_read() describes constraints to the type of files that can be read with the heartbeatr-package and explains how time zones are handled.
pulse_split() provides important advice on how to set window_width_secs and window_shift_secs, what to expect when lower/higher values are used, and explains how easily to run the heartbeatr-package with parallel computing.
pulse_optimize() explains in detail how the optimization process (interpolation + smoothing) behaves and how it impacts the performance of the analysis.
pulse_heart() outlines the algorithm used to identify peaks in the heart beat wave data and some of its limitations.
pulse_doublecheck() explains the method used to detect situations when the algorithm's processing resulted in an heart beat frequency double the real value.
pulse_choose_keep() selects the best estimates when raw_v_smoothed = TRUE and classifies data points as keep or reject.

Also check

pulse_normalize() for important info about individual variations on baseline heart rate.
pulse_summarise() for important info about oversampling and strategies to handle that.
PULSE_by_chunks() for processing large datasets.

BPM

To convert to Beats Per Minute (bpm), simply multiply hz and ci by 60.

Examples

Run this code

## Begin prepare data ----
paths <- pulse_example()
chn <- paste0("c", formatC(1:10, width = 2, flag = "0"))
## End prepare data ----

# Execute the entire PULSE data processing pipeline with only one call
PULSE(
paths,
 discard_channels = chn[-8],
 raw_v_smoothed   = FALSE,
 show_progress    = FALSE
 )

# Equivalent to...
x <- pulse_read(paths)
multi <- x$multi
x$data <- x$data[,c("time", "c08")]
x <- pulse_split(x)
x <- pulse_optimize(x, raw_v_smoothed = FALSE, multi = multi)
x <- pulse_heart(x)
x <- pulse_doublecheck(x)
x <- pulse_choose_keep(x)
x

Run the code above in your browser using DataLab