segment: Segment a sound

Description

Finds syllables and bursts separated by background noise in long recordings (up to 1-2 hours of audio per file). Syllables are defined as continuous segments that seem to be different from noise based on amplitude and/or spectral similarity thresholds. Bursts are defined as local maxima in signal envelope that are high enough both in absolute terms (relative to the global maximum) and with respect to the surrounding region (relative to local minima). See vignette('acoustic_analysis', package = 'soundgen') for details.

Usage

segment(
  x,
  samplingRate = NULL,
  scale = NULL,
  from = NULL,
  to = NULL,
  shortestSyl = 40,
  shortestPause = 40,
  method = c("env", "spec", "mel")[3],
  propNoise = NULL,
  SNR = NULL,
  noiseLevelStabWeight = c(1, 0.25),
  windowLength = 40,
  step = windowLength/5,
  overlap = 80,
  reverbPars = list(reverbDelay = 70, reverbSpread = 130, reverbLevel = -35,
    reverbDensity = 50),
  interburst = NULL,
  peakToTrough = SNR + 3,
  troughLocation = c("left", "right", "both", "either")[4],
  summaryFun = c("median", "sd"),
  maxDur = 30,
  ptvStep = NULL,
  ptvTime = 0.5,
  ptvFreq = 1,
  reportEvery = NULL,
  cores = 1,
  plot = FALSE,
  savePlots = NULL,
  saveAudio = NULL,
  addSilence = 50,
  main = NULL,
  xlab = "",
  ylab = "Signal, dB",
  showLegend = FALSE,
  width = 900,
  height = 500,
  units = "px",
  res = NA,
  maxPoints = c(1e+05, 5e+05),
  specPlot = list(colorTheme = "bw"),
  contourPlot = list(lty = 1, lwd = 2, col = "green"),
  sylPlot = list(lty = 1, lwd = 2, col = "blue"),
  burstPlot = list(pch = 8, cex = 3, col = "red"),
  ...
)

Value

Returns a list with the following components:

bursts: the time and amplitude of each burst plus the pause before it (ms)
summary: a summary of temporal descriptives per file, including PTV = sum of all syllable durations / total duration of audio
noise: the spectrum of background noise
ptv: contours of instantaneous proportion of time vocalizing: time = time in ms, on = 1 if this frame is part of a syllable and 0 otherwise, ptv_lowpass = "on" after applying a low-pass filter over ptvFreq Hz, ptv_conv = "on" after convolution with a half-Gaussian filter with SD = ptvTime
signal: a dataframe containing the time and value of the contour used to segment the sound (some form of envelope or the difference of spectrum from background noise, depending on the method)

Arguments

x: path to a folder, one or more wav or mp3 files c('file1.wav', 'file2.mp3'), Wave object, numeric vector, or a list of Wave objects or numeric vectors
samplingRate: sampling rate of x (only needed if x is a numeric vector)
scale: maximum possible amplitude of input used for normalization of input vector (only needed if x is a numeric vector)
from, to: if NULL (default), analyzes the whole sound, otherwise from...to (s)
shortestSyl: minimum acceptable length of syllables, ms
shortestPause: minimum acceptable break between syllables, ms (syllables separated by shorter pauses are merged)
method: the signal used to search for syllables: 'env' = Hilbert-transformed amplitude envelope, 'spec' = spectrogram, 'mel' = mel-transformed spectrogram (see tuneR::melfcc)
propNoise: the proportion of non-zero sound assumed to represent background noise, 0 to 1 (note that complete silence is not considered, so padding with silence won't affect the algorithm). Set to 0 to skip correcting SNR by background noise level
SNR: expected signal-to-noise ratio (dB above noise), which determines the threshold for syllable detection. The meaning of "dB" here is approximate since the "signal" may be different from sound intensity, depending on the method
noiseLevelStabWeight: a vector of length 2 specifying the relative weights of the overall signal level vs. stability when attempting to automatically locate the regions that represent noise. Increasing the weight of stability tends to accentuate the beginning and end of each syllable.
windowLength: length of FFT window, ms (multiple values in a vector produce a multi-resolution spectrogram)
step: you can override overlap by specifying FFT step, ms - a vector of the same length as windowLength (NB: because digital audio is sampled at discrete time intervals of 1/samplingRate, the actual step and thus the time stamps of STFT frames may be slightly different, eg 24.98866 instead of 25.0 ms)
overlap: overlap between successive FFT frames, %
reverbPars: parameters passed on to reverb to attempt to cancel the effects of reverberation or echo, which otherwise tend to merge short and loud segments like rapid barks
interburst: minimum time between two consecutive bursts (ms). Defaults to the average detected (syllable + pause) / 2
peakToTrough: to qualify as a burst, a local maximum has to be at least peakToTrough dB above the left and/or right local trough(s) (controlled by troughLocation) over the analysis window (controlled by interburst). Defaults to SNR + 3 dB
troughLocation: should local maxima be compared to the trough on the left and/or right of it? Values: 'left', 'right', 'both', 'either'
summaryFun: functions used to summarize each acoustic characteristic; see analyze
maxDur: long files are split into chunks maxDur s in duration to avoid running out of RAM; the outputs for all fragments are glued together, but plotting is switched off. Note that noise profile is estimated in each chunk separately, so set it low if the background noise is highly variable
ptvStep, ptvTime, ptvFreq: the instantaneous proportion of time vocalizing (PTV) is calculated by producing a binary (sound on/off) contour with a step of ptvStep ms and convolving it with a half-Gaussian filter with SD = ptvTime ($ptv_conv) and by low-pass filtering it over ptvFreq ($ptv_lowpass). ptvStep defaults to the same value as step
reportEvery: when processing multiple inputs, report estimated time left every ... iterations (NULL = default, NA = don't report)
cores: number of cores for parallel processing
plot: if TRUE, produces a segmentation plot
savePlots: full path to the folder in which to save the plots (NULL = don't save, '' = same folder as audio)
saveAudio: full path to the folder in which to save audio files (one per detected syllable)
addSilence: if syllables are saved as separate audio files, they can be padded with some silence (ms)
xlab, ylab, main: main plotting parameters
showLegend: if TRUE, shows a legend for thresholds
width, height, units, res: parameters passed to png if the plot is saved
maxPoints: the maximum number of "pixels" in the oscillogram (if any) and spectrogram; good for quickly plotting long audio files; defaults to c(1e5, 5e5); does not affect reassigned spectrograms
specPlot: a list of graphical parameters for displaying the spectrogram (if method = 'spec' or 'mel'); set to NULL to hide the spectrogram
contourPlot: a list of graphical parameters for displaying the signal contour used to detect syllables (see details)
sylPlot: a list of graphical parameters for displaying the syllables
burstPlot: a list of graphical parameters for displaying the bursts
...: other graphical parameters passed to graphics::plot

Details

Algorithm: for each chunk at most maxDur long, first the audio recording is partitioned into signal and noise regions: the quietest and most stable regions are located, and noise threshold is defined from a user-specified proportion of noise in the recording (propNoise) or, if propNoise = NULL, from the lowest local maximum in the density function of a weighted product of amplitude and stability (that is, we assume that quiet and stable regions are likely to represent noise). Once we know what the noise looks like - in terms of its typical amplitude and/or spectrum - we derive signal contour as its difference from noise at each time point. If method = 'env', this is Hilbert transform minus noise, and if method = 'spec' or 'mel', this is the inverse of cosine similarity between the spectrum of each frame and the estimated spectrum of noise weighted by amplitude. By default, signal-to-noise ratio (SNR) is estimated as half-median of above-noise signal, but it is recommended that this parameter is adjusted by hand to suit the purposes of segmentation, as it is the key setting that controls the balance between false negatives (missing faint signals) and false positives (hallucinating signals that are actually noise). Note also that effects of echo or reverberation can be taken into account: syllable detection threshold may be raised following powerful acoustic bursts with the help of the reverbPars argument. At the final stage, continuous "islands" SNR dB above noise level are detected as syllables, and "peaks" on the islands are detected as bursts. The algorithm is very flexible, but the parameters may be hard to optimize by hand. If you have an annotated sample of the sort of audio you are planning to analyze, with syllables and/or bursts counted manually, you can use it for automatic optimization of control parameters (see optimizePars).

Examples

Run this code

sound = soundgen(nSyl = 4, sylLen = 100, pauseLen = 70,
                 attackLen = 20, amplGlobal = c(0, -20),
                 pitch = c(368, 284), temperature = .001)
# add noise so SNR decreases from 20 to 0 dB from syl1 to syl4
sound = sound + runif(length(sound), -10 ^ (-20 / 20), 10 ^ (-20 / 20))
# osc(sound, samplingRate = 16000, dB = TRUE)
# spectrogram(sound, samplingRate = 16000, osc = TRUE)
# playme(sound, samplingRate = 16000)

s = segment(sound, samplingRate = 16000, plot = TRUE)
str(s)

# customizing the plot
segment(sound, samplingRate = 16000, plot = TRUE,
        sylPlot = list(lty = 2, col = 'gray20'),
        burstPlot = list(pch = 16, col = 'blue'),
        specPlot = list(col = rev(heat.colors(50))),
        xlab = 'Some custom label', cex.lab = 1.2,
        showLegend = TRUE,
        main = 'My awesome plot')

# plot the PTV contour (proportion of time vocalizing)
plot(s$ptv$time, s$ptv$on, type = 'l', xlab = 'Time, ms',
  ylab = 'Prop. time voc.')
points(s$ptv$time, s$ptv$ptv_conv, type = 'l', col = 'blue')
points(s$ptv$time, s$ptv$ptv_lowpass, type = 'l', col = 'red')
s$summary$ptv; mean(s$ptv$ptv_conv); mean(s$ptv$ptv_lowpass) # similar

if (FALSE) {
# set SNR manually to control detection threshold
s = segment(sound, samplingRate = 16000, SNR = 1, plot = TRUE)

# simple intensity threshold (anything >5 dB is signal)
segment(sound, 16000, method = 'env', SNR = 5, plot = TRUE,
  # very little smoothing for maximally precise timing
  windowLength = 5, step = 1,
  # don't correct SNR based on estimated background noise
  propNoise = 0,
  # don't use dynamic thresholds to cancel reverb
  reverbPars = NULL
)

# different ways to calculate instantaneous PTV
s2 = segment(sound, 16000, ptvTime = 2, ptvFreq = 5)
s3 = segment(sound, 16000, ptvTime = 0.05, ptvFreq = 0.5)
plot(s2$ptv$time, s2$ptv$on, type = 'l', xlab = 'Time, ms',
  ylab = 'Prop. time voc.')
points(s2$ptv$time, s2$ptv$ptv_conv, type = 'l', col = 'blue')
points(s2$ptv$time, s2$ptv$ptv_lowpass, type = 'l', col = 'yellow')
points(s3$ptv$time, s3$ptv$ptv_conv, type = 'l', col = 'purple')
points(s3$ptv$time, s3$ptv$ptv_lowpass, type = 'l', col = 'orange')

# Download 260 sounds from the supplements to Anikin & Persson (2017) at
# http://cogsci.se/publications.html
# unzip them into a folder, say '~/Downloads/temp'
myfolder = '~/Downloads/temp260'  # 260 .wav files live here
s = segment(myfolder, propNoise = .05, SNR = 3)

# Check accuracy: import a manual count of syllables (our "key")
key = segmentManual  # a vector of 260 integers
trial = as.numeric(s$summary$nBursts)
cor(key, trial, use = 'pairwise.complete.obs')
boxplot(trial ~ as.integer(key), xlab='key')
abline(a=0, b=1, col='red')

# or look at the detected syllables instead of bursts:
cor(key, s$summary$nSyl, use = 'pairwise.complete.obs')
}