dpseg
splits a curve (x,y data) into linear segments by a
straight forward dynamic programming recursion: $$S_j =
max(S_{i-jumps} + score(i,j) -P)$$ where score is a measure of the
goodnes of the fit of a linear regression (equiv. to
lm(y~x)
) between data points \(i<j\). The default scoring
function is simply the negative variance of residuals of the linear
regression (see arguments type
and scoref
). P
is a break-point penality that implicitly regulates the number of
segments (higher P: longer segments), and jumps==1
allows
for disjoint segments. The arguments minl
and maxl
specify minimal (\(i\le j-minl\)) and maximal (\(i\ge j-maxl\))
segment lengths, which allows to significantly decrease memory
usage when expected segment lengths are known.
dpseg(x, y, maxl, jumps = FALSE, P = 0, minl = 3, S0 = 1,
type = "var", scoref, verb = 1, move, store.values = TRUE,
store.matrix = FALSE, add.lm = FALSE, recursion, backtrace, ...)
x-values, not used if y
is a scoring function
matrix
y-values, or a pre-calculated scoring function matrix
\(SCR_{i,j}\) (eg. from a previous run of dpseg
). See
section "Value" below for details on the structure \(SCR_{i,j}\).
maximal segment length, \(i\ge j-maxl\)
allow for jumps between segments, if TRUE
segment ends are 1 index left of the segment starts
break-point penalty, increase to get longer segments with lower scores (eg. higher residual variance)
minimal segment length, \(i\le j-minl\)
initialization of \(S_0\), choose high enough to avoid length 1 cutoffs at start
type of scoring function: available are "var" for
"variance of residuals", "cor" for Pearson correlation, or "r2"
for r-squared; see the package vignette("dpseg")
for
details.
alternative scoring function
print progress messages
logical indicating whether move is required in backtracing, required for the alternative recursion \(S_i + score(i+1,j)\)
store scoring values (linear regression results)
store the fitscore matrix
add a linear fit using R base lm
for final
segments; may save memory/speed if store.values==FALSE
internal recursion function to be used for
segmentation; used for debugging, benchmarking and development,
and required for putative novel scoring functions scoref
internal function to be used for back-tracing;
used for debugging, benchmarking and development, and may be
required to test novel scoring functions scoref
and/or
recursion
further arguments to recursion
Returns a list object of class dpseg
(with
print.dpseg
plot.dpseg
and predict.dpseg
methods). The main result of the algorithm is a table
(data.frame
) of predicted segments in list object
segments
. The original data, run parameters and
(optionally) additional data calculated and used by the
algorithm are also returned.
main result table: a data.frame
that lists
the start and end x-values of the segments, the start and end
indices (i,j) in the data vectors, the linear regression
coefficients and goodness-of-fit measures for the segments
(intercept, slope, r-squared, variance of residuals). If
dpseg
was called with a pre-calculated scoring matrix,
the table only contains start and end indices i,j. If option
add.lm=TRUE
or the result object was sent through
function addLm
the table additionally contains
results from R's lm
, indicated by an ".lm" suffix.
results of the recursion, ie. \(S_j\) in above equation.
vector \(j=1,\dots,n\), storing the \(i_{max}\) that yielded \(S_j\), ie., the sole input for the backtracing function.
linear regression coefficients and measures for the
segment ending at \(j\) and starting at
\(i_{max}(j)\). Only present if store.valus=TRUE
.
scoring function matrix \(SCR_{i,j} = score(i,j)\)
where positions j are the columns and i the rows; a banded
matrix with non-NA values between \(i\le j-minl\) and
\(i\ge j-maxl\). Note, that this matrix can be re-used in
subsequent calls as dpseg(y=previous$SCR)
which runs
much faster and allows to efficiently scan for alternative
parameters. Only present if store.matrix=TRUE
.
result objects from lm
. Only present if
add.lm=TRUE
.
result of the call to the backtracing function: ends of the segments.
original x/y data (xy.coords).
index of NA/Inf values that were removed before running the alorithm.
used parameters P
, jumps
,
maxl
and minl
.
The package strictly depends only on RcppEigen
.
All other dependencies are usually present in a
basic installation (stats
, graphics
, grDevices
).
See the vignette("dpseg")
for the theory and details on the
choice of scoring functions and selection of the penalty parameter
P.
# NOT RUN {
## calculate linear segments in semi-log bacterial growth data
## NOTE: library loads bacterial growth curve data as data.frame oddata
segs <- dpseg(x=oddata$Time, y=log(oddata$A3), minl=5, P=0.0001, verb=1)
## inspect resulting segments
print(segs)
## plot results (also see the movie method)
plot(segs, delog=TRUE, log="y")
## predict method
plot(predict(segs), type="l")
# }
Run the code above in your browser using DataLab