Learn R Programming

segmented (version 1.6-4)

selgmented: Selecting the number of breakpoints in segmented regression

Description

This function selects (and estimates) the number of breakpoints of the segmented relationship according to the BIC/AIC criterion or sequential hypothesis testing.

Usage

selgmented(olm, seg.Z, Kmax=2, type = c("score", "bic", "davies", "aic"), alpha = 0.05,
  control = seg.control(), refit = FALSE, stop.if = 6, return.fit = TRUE, 
  bonferroni = FALSE, msg = TRUE, plot.ic = FALSE, th = NULL, G = 1, check.dslope = TRUE)

Value

The returned object depends on argument return.fit. If FALSE, the returned object is a list with some information on the compared models (i.e. the BIC values), otherwise a classical 'segmented' object with the component selection.psi including the aforementioned information. See segmented for details.

Arguments

olm

A starting lm or glm object, or a simple numerical vector meaning the response variable.

seg.Z

A one-side formula for the segmented variable. Only one term can be included, and it can be omitted if olm includes just one covariate.

alpha

The fixed type I error probability when sequential hypothesis testing is carried out (i.e. type='score' or 'davies'). It is also used when type='bic' (or type='aic') and check.dslope=TRUE to remove the breakpoints based on the slope diffence t-value.

type

Which criterion should be used? Options score and davies allow to carry out sequential hypothesis testing with no more than 2 breakpoints (Kmax=2). Alternatively, the number of breakpoints can be selected via the BIC (or AIC) with virtually no upper bound for Kmax.

control

See seg.control.

refit

If TRUE, the final selected model is re-fitted using arguments in control, typically with bootstrap restarting. Set refit=FALSE to speed up computation (and possibly accepting near-optimal estimates). Ignored if type='score' or type='davies'.

stop.if

An integer. If, when trying an increasing number of breakpoints, there occur stop.if fits with higher AIC/BIC values, the search is interrupted. Set a large number, stop.if=100 say, if you want to assess the fits for all values 0, 1, 2, ..., Kmax. Ignored if type='score' or type='davies'.

return.fit

If TRUE, the fitted model (with the number of breakpoints selected according to type) is returned.

bonferroni

If TRUE, the Bonferroni correction is employed, i.e. alpha/Kmax (rather than alpha) is always taken as threshold value to reject or not. If FALSE, alpha is used in the second level of hypothesis testing. It is also effective when type="bic" (or 'aic') and check.dslope=TRUE, see Details.

Kmax

The maximum number of breakpoints being tested. If type='bic' or type='aic' any integer value can be specified; otherwise at most Kmax=2 breakpoints can be tested via the Score or Davies statistics.

msg

If FALSE the final fit is returned silently with the selected number of breakpoints, otherwise the message including information about the selection procedure (i.e. the BIC values), and the possible warnings are also printed.

plot.ic

If TRUE the information criterion values with respect to the number of breakpoints are plotted. Ignored if type='score' or type='davies'.

th

When a large number of breakpoints is being tested, it could happen that 2 estimated breakpoints are too close each other, and only one can be retained. Thus if the difference between two breakpoints is less or equal to th, one (the first) breakpoint is removed. Of course, th depends on the x scale: Integers, like 5 or 10, are appropriate if the covariate is the observation index. Default (NULL) means th=diff(range(x))/100.Ignored if type='score' or type='davies'.

G

Number of sub-intervals to consider to search for the breakpoints. See Details.

check.dslope

Logical. Effective only if type="bic" or 'aic'. After breakpoint selection performed by BIC/AIC, should the \(t\) values of the slope differences be checked? See Details.

Author

Vito M. R. Muggeo

Details

The function uses properly the functions segmented, pscore.test or davies.test to select the 'optimal' number of breakpoints 0,1,...,Kmax. If type='bic' or 'aic', the procedure stops if the last stop.if fits have increasing values of the information criterion. When \(G>1\) the dataset is split into \(G\) groups and the search is carried out separately within each group. \(G>1\), for instance G=3 or 4, is suggested when there are many breakpoints not evenly spaced in the covariate range.

Note Kmax is tacitely reduced in order to have at least 1 residual df in the model with Kmax changepoints. Namely, if \(n=20\) the largest Kmax allowed is 8.

References

Muggeo V (2020) Selecting number of breakpoints in segmented regression: implementation in the R package segmented https://www.researchgate.net/publication/343737604

See Also

segmented, pscore.test, davies.test

Examples

Run this code

set.seed(12)
xx<-1:100
zz<-runif(100)
yy<-2+1.5*pmax(xx-35,0)-1.5*pmax(xx-70,0)+15*pmax(zz-.5,0)+rnorm(100,0,2)
dati<-data.frame(x=xx,y=yy,z=zz)
out.lm<-lm(y~x,data=dati)

os <-selgmented(out.lm) #selection (Kmax=2) via the Score test (default)

os <-selgmented(out.lm, type="bic", Kmax=3) #BIC-based selection

if (FALSE) {
########################################
#Selecting a large number of breakpoints

b <- c(-1,rep(c(1.5,-1.5),l=15))
psi <- seq(.1,.9,l=15)
n <- 2000
x <- 1:n/n
X <- cbind(x, outer(x,psi,function(x,y)pmax(x-y,0)))
mu <- drop(tcrossprod(X,t(b)))
y<- mu + rnorm(n)*.02 
par(mfrow=c(1,2))

#select number of breakpoints via the BIC (and plot it)
o<-selgmented(y, Kmax=20, type='bic', plot.ic=TRUE, check.dslope = FALSE) 
plot(o, res=TRUE, col=2, lwd=3)

# select via the BIC + check on the slope differences (default)
o1 <-selgmented(y, Kmax=20, type='bic') #check.dslope = TRUE by default
plot(o1, add=TRUE, col=3)


##################################################
#a large number of breakpoints not evenly spaced.  
b <- c(-1,rep(c(2,-2),l=10))
psi <- seq(.5,.9,l=10)
n <- 2000
x <- 1:n/n
X <- cbind(x, outer(x,psi,function(x,y)pmax(x-y,0)))
mu <- drop(tcrossprod(X,t(b)))
y<- mu + rnorm(n)*.02 

#run selgmented with G>1. G=3 or 4 recommended. 
#note G=1 does not return the right number of breaks  
o1 <-selgmented(y, type="bic", Kmax=20, G=4)
}

Run the code above in your browser using DataLab