dpseg: dpseg : linear segmentation by dynamic programming

Description

dpseg splits a curve (x,y data) into linear segments by a straight forward dynamic programming recursion: $$S_j = max(S_{i-jumps} + score(i,j) -P)$$ where score is a measure of the goodnes of the fit of a linear regression (equiv. to lm(y~x)) between data points $i<j$. The default scoring function is simply the negative variance of residuals of the linear regression (see arguments type and scoref). P is a break-point penality that implicitly regulates the number of segments (higher P: longer segments), and jumps==1 allows for disjoint segments. The arguments minl and maxl specify minimal ($i\le j-minl$) and maximal ($i\ge j-maxl$) segment lengths, which allows to significantly decrease memory usage when expected segment lengths are known.

Usage

dpseg(x, y, maxl, jumps = FALSE, P = 0, minl = 3, S0 = 1,
  type = "var", scoref, verb = 1, move, store.values = TRUE,
  store.matrix = FALSE, add.lm = FALSE, recursion, backtrace, ...)

Arguments

x-values, not used if y is a scoring function matrix

y-values, or a pre-calculated scoring function matrix $SCR_{i,j}$ (eg. from a previous run of dpseg). See section "Value" below for details on the structure $SCR_{i,j}$.

maxl

maximal segment length, $i\ge j-maxl$

jumps

allow for jumps between segments, if TRUE segment ends are 1 index left of the segment starts

break-point penalty, increase to get longer segments with lower scores (eg. higher residual variance)

minl

minimal segment length, $i\le j-minl$

initialization of $S_0$, choose high enough to avoid length 1 cutoffs at start

type

type of scoring function: available are "var" for "variance of residuals", "cor" for Pearson correlation, or "r2" for r-squared; see the package vignette("dpseg") for details.

scoref

alternative scoring function

verb

print progress messages

move

logical indicating whether move is required in backtracing, required for the alternative recursion $S_i + score(i+1,j)$

store.values

store scoring values (linear regression results)

store.matrix

store the fitscore matrix

add.lm

add a linear fit using R base lm for final segments; may save memory/speed if store.values==FALSE

recursion

internal recursion function to be used for segmentation; used for debugging, benchmarking and development, and required for putative novel scoring functions scoref

backtrace

internal function to be used for back-tracing; used for debugging, benchmarking and development, and may be required to test novel scoring functions scoref and/or recursion

...

further arguments to recursion

Value

Returns a list object of class dpseg (with print.dpseg plot.dpseg and predict.dpseg methods). The main result of the algorithm is a table (data.frame) of predicted segments in list object segments. The original data, run parameters and (optionally) additional data calculated and used by the algorithm are also returned.

segments:: main result table: a data.frame that lists the start and end x-values of the segments, the start and end indices (i,j) in the data vectors, the linear regression coefficients and goodness-of-fit measures for the segments (intercept, slope, r-squared, variance of residuals). If dpseg was called with a pre-calculated scoring matrix, the table only contains start and end indices i,j. If option add.lm=TRUE or the result object was sent through function addLm the table additionally contains results from R's lm, indicated by an ".lm" suffix.
S:: results of the recursion, ie. $S_j$ in above equation.
imax:: vector $j=1,\dots,n$, storing the $i_{max}$ that yielded $S_j$, ie., the sole input for the backtracing function.
values:: linear regression coefficients and measures for the segment ending at $j$ and starting at $i_{max}(j)$. Only present if store.valus=TRUE.
SCR:: scoring function matrix $SCR_{i,j} = score(i,j)$ where positions j are the columns and i the rows; a banded matrix with non-NA values between $i\le j-minl$ and $i\ge j-maxl$. Note, that this matrix can be re-used in subsequent calls as dpseg(y=previous$SCR) which runs much faster and allows to efficiently scan for alternative parameters. Only present if store.matrix=TRUE.
fits:: result objects from lm. Only present if add.lm=TRUE.
traceback:: result of the call to the backtracing function: ends of the segments.
xy:: original x/y data (xy.coords).
removed:: index of NA/Inf values that were removed before running the alorithm.
parameters:: used parameters P, jumps, maxl and minl.

Dependencies

The package strictly depends only on RcppEigen. All other dependencies are usually present in a basic installation (stats, graphics, grDevices).

Details

See the vignette("dpseg") for the theory and details on the choice of scoring functions and selection of the penalty parameter P.

Examples

Run this code

# NOT RUN {
## calculate linear segments in semi-log bacterial growth data
## NOTE: library loads bacterial growth curve data as data.frame oddata
segs <- dpseg(x=oddata$Time, y=log(oddata$A3), minl=5, P=0.0001, verb=1)

## inspect resulting segments
print(segs)

## plot results (also see the movie method)
plot(segs, delog=TRUE, log="y")

## predict method
plot(predict(segs), type="l")

# }

Run the code above in your browser using DataLab