PULSE_by_chunks: Process PULSE data file by file (`STEPS 1-6`)

Description

This function runs PULSE() file by file, instead of attempting to read all files at once. This is required when datasets are too large (more than 20-30 files), as otherwise the system may become stuck due to the amount of data that needs to be kept in the memory. Because the results of processing data for each hourly file in the dataset are saved to a job_folder, PULSE_by_chunks() has the added benefit of allowing the entire job to be stopped and resumed, facilitating the advance in the processing even if a crash occurs.

Usage

PULSE_by_chunks(
  folder,
  allow_dir_create = FALSE,
  chunks = 2,
  bind_data = TRUE,
  window_width_secs = 30,
  window_shift_secs = 60,
  min_data_points = 0.8,
  interpolation_freq = 40,
  bandwidth = 0.2,
  doublecheck = TRUE,
  lim_n = 3,
  lim_sd = 0.75,
  raw_v_smoothed = TRUE,
  correct = TRUE,
  discard_channels = NULL,
  keep_raw_data = TRUE,
  show_progress = TRUE
)

Value

A tibble with nrows = (number of channels) * (number of windows in pulse_data_split) and 13 columns:

i, the order of each time window
smoothed, logical flagging smoothed data
id, PULSE channel IDs
time, time at the center of each time window
data, a list of tibbles with raw PULSE data for each combination of channel and window, with columns time, val and peak (TRUE in rows corresponding to wave peaks)
hz, heartbeat rate estimate (in Hz)
n, number of wave peaks identified
sd, standard deviation of the intervals between wave peaks
ci, confidence interval (hz ± ci)
keep, logical indicating whether data points meet N and SD criteria
d_r, ratio of consecutive asymmetric peaks
d_f, logical flagging data points where heart beat frequency is likely double the real value

Arguments

folder: the path to a folder where several PULSE files are stored
allow_dir_create: logical, defaults to FALSE. Only when set to TRUE does PULSE_by_chunks() actually do anything. This is to force the user to accept that a job_folder will be created inside of the folder supplied - without this folder PULSE_by_chunks() cannot operate. It is STRONGLY advised to maintain a copy of the dataset being processed to avoid any inadvertent data loss. By setting allow_dir_create to TRUEthe user is taking responsibility for the management of their files.
chunks: numeric, defaults to 2. Corresponds to the number of files processed at once during each for cycle; higher numbers result in a quicker and more efficient operation, but shouldn't be set too high, as otherwise the system may become overwhelmed once more (which is what PULSE_by_chunks() is designed to avoid).
bind_data: logical, defaults to TRUE. If set to TRUE, after processing all chunks, PULSE_by_chunks() will try to read all files in the job_folder and return a single unified tibble with all data. Please be aware that there's a possibility that if the dataset is very large, the machine may become overwhelmed and crash due to lack of memory (still, all files stored in the job_folder will remain intact, and code may be written to analyze data also in chunks). If set to FALSE, PULSE_by_chunks() will return nothing after completing the processing of all files in the dataset, and the user must instead manually handle the reading and collating of all processed data in the job_folder.
window_width_secs: numeric, in seconds, defaults to 30; the width of the time windows over which heart rate frequency will be computed.
window_shift_secs: numeric, in seconds, defaults to 60; by how much each subsequent window is shifted from the preceding one.
min_data_points: numeric, defaults to 0.8; decimal from 0 to 1, used as a threshold to discard incomplete windows where data is missing (e.g., if the sampling frequency is 20 and window_width_secs = 30, each window should include 600 data points, and so if min_data_points = 0.8, windows with less than 600 * 0.8 = 480 data points will be rejected).
interpolation_freq: numeric, defautls to 40; value expressing the frequency (in Hz) to which PULSE data should be interpolated. Can be set to 0 (zero) or any value equal or greater than 40 (the default). If set to zero, no interpolation is performed.
bandwidth: numeric, defaults to 0.2; the bandwidth for the Kernel Regression Smoother. If equal to 0 (zero) no smoothing is applied. Normally kept low (0.1 - 0.3) so that only very high frequency noise is removed, but can be pushed up all the way to 1 or above (especially when the heartbeat rate is expected to be slow, as is typical of oysters, but double check the resulting data). Type ?ksmooth for additional info.
doublecheck: logical, defaults to TRUE; should pulse_doublecheck() be used? (it is rare, but there are instances when it should be disabled).
lim_n: numeric, defaults to 3; minimum number of peaks detected in each time window for it to be considered a "keep".
lim_sd: numeric, defaults to 0.75; maximum value for the sd of the time intervals between each peak detected for it to be considered a "keep"
raw_v_smoothed: logical, defaults to TRUE; indicates whether or not to also compute heart rates before applying smoothing; this will increase the quality of the output but also double the processing time.
correct: logical, defaults to TRUE; if FALSE, data points with hz values likely double the real value are flagged BUT NOT CORRECTED. If TRUE, hz (as well as data, n, sd and ci) are corrected accordingly. Note that the correction is not reversible!
discard_channels: character vectors, containing the names of channels to be discarded from the analysis. discard_channels is forced to lowercase, but other than that, the exact names must be provided. Discarding unused channels can greatly speed the workflow!
keep_raw_data: logical, defaults to TRUE; If set to FALSE, $data is set to FALSE (i.e., raw data is discarded), dramatically reducing the amount of disk space required to store the final output (usually, by two orders of magnitude). HOWEVER, note that it won't be possible to use pulse_plot_raw() anymore!
show_progress: logical, defaults to FALSE. If set to TRUE, progress messages will be provided.

Examples

Run this code

##

Run the code above in your browser using DataLab

Description

Usage

Value

Arguments

See Also

Examples