fixreg: Linear Regression Fixed Point Clusters

Description

Computes linear regression fixed point clusters (FPCs), i.e., subsets of the data, which consist exactly of the non-outliers w.r.t. themselves, and may be interpreted as generated from a homogeneous linear regression relation between independent and dependent variable. FPCs may overlap, are not necessarily exhausting and do not need a specification of the number of clusters.

Note that while fixreg has lots of parameters, only one (or few) of them have usually to be specified, cf. the examples. The philosophy is to allow much flexibility, but to always provide sensible defaults.

Usage

fixreg(indep=rep(1,n), dep, n=length(dep),
                    p=ncol(as.matrix(indep)),
                    ca=NA, mnc=NA, mtf=3, ir=NA, irnc=NA,
                    irprob=0.95, mncprob=0.5, maxir=20000, maxit=5*n,
                    distcut=0.85, init.group=list(), 
                    ind.storage=FALSE, countmode=100, 
                    plot=FALSE)
## S3 method for class 'rfpc':
summary(object, ...)
## S3 method for class 'summary.rfpc':
print(x, maxnc=30, ...)
## S3 method for class 'rfpc':
plot(x, indep=rep(1,n), dep, no, bw=TRUE,
                      main=c("Representative FPC No. ",no),
                      xlab="Linear combination of independents",
                      ylab=deparse(substitute(indep)),
                      xlim=NULL, ylim=range(dep), 
                      pch=NULL, col=NULL,...)
## S3 method for class 'rfpc':
fpclusters(object, indep=NA, dep=NA, ca=object$ca, ...)
rfpi(indep, dep, p, gv, ca, maxit, plot)

Arguments

indep

numerical matrix or vector. Independent variables. Leave out for clustering one-dimensional data. fpclusters.rfpc does not need specification of indep if fixreg was run with ind.storage=TRUE

dep

numerical vector. Dependent variable. fpclusters.rfpc does not need specification of dep if fixreg was run with ind.storage=TRUE.

optional positive integer. Number of cases.

optional positive integer. Number of independent variables.

optional positive number. Tuning constant, specifying required cluster separation. By default determined automatically as a function of n and p, see function can, Henni

mnc

optional positive integer. Minimum size of clusters to be reported. By default determined automatically as a function of mncprob. See Hennig (2002a).

mtf

optional positive integer. FPCs must be found at least mtf times to be reported by summary.rfpc.

optional positive integer. Number of algorithm runs. By default determined automatically as a function of n, p, irnc, irprob, mtf, maxir. See function

irnc

optional positive integer. Size of the smallest cluster to be found with approximated probability irprob.

irprob

optional value between 0 and 1. Approximated probability for a cluster of size irnc to be found.

mncprob

optional value between 0 amd 1. Approximated probability for a cluster of size mnc to be found.

maxir

optional integer. Maximum number of algorithm runs.

maxit

optional integer. Maximum number of iterations per algorithm run (usually an FPC is found much earlier).

distcut

optional value between 0 and 1. A similarity measure between FPCs, given in Hennig (2002a), and the corresponding Single Linkage groups of FPCs with similarity larger than distcut are computed. A single representative FPC is s

init.group

optional list of logical vectors of length n. Every vector indicates a starting configuration for the fixed point algorithm. This can be used for datasets with high dimension, where the vectors of init.group indic

ind.storage

optional logical. If TRUE, then all indicator vectors of found FPCs are given in the value of fixreg. May need lots of memory, but is a bit faster.

countmode

optional positive integer. Every countmode algorithm runs fixreg shows a message.

plot

optional logical. If TRUE, you get a scatterplot of first independent vs. dependent variable at each iteration.

object

object of class rfpc, output of fixreg.

maxnc

positive integer. Maximum number of FPCs to be reported.

positive integer. Number of the representative FPC to be plotted.

optional logical. If TRUE, plot is black/white, FPC is indicated by different symbol. Else FPC is indicated red.

main

plot title.

xlab

label for x-axis.

ylab

label for y-axis.

xlim

plotted range of x-axis. If NULL, the range of the plotted linear combination of independent variables is used.

ylim

plotted range of y-axis.

pch

plotting symbol, see par. If NULL, the default is used.

col

plotting color, see par. If NULL, the default is used.

logical vector of length n. Indicates the initial configuration for the fixed point algorithm.

...

additional parameters to be passed to plot (no effects elsewhere).

Value

fixreg returns an object of class rfpc. This is a list containing the components nc, g, coefs, vars, nfound, er, tsc, ncoll, grto, imatrix, smatrix, stn, stfound, sfpc, ssig, sto, struc, n, p, ca, ir, mnc, mtf, distcut.
summary.rfpc returns an object of class summary.rfpc. This is a list containing the components coefs, vars, stfound, stn, sn, ser, tsc, sim, ca, ir, mnc, mtf.
fpclusters.rfpc returns a list of indicator vectors for the representative FPCs of stable groups.
rfpi returns a list with the components coef, var, g, coll, ca.
ncinteger. Number of FPCs.
glist of logical vectors. Indicator vectors of FPCs. FALSE if ind.storage=FALSE.
coefslist of numerical vectors. Regression coefficients of FPCs. In summary.rfpc, only for representative FPCs of stable groups and sorted according to stfound.
varslist of numbers. Error variances of FPCs. In summary.rfpc, only for representative FPCs of stable groups and sorted according to stfound.
nfoundvector of integers. Number of findings for the FPCs.
ernumerical vector. Expectation ratios of FPCs. Can be taken as a stability measure.
tscinteger. Number of algorithm runs leading to too small or too seldom found FPCs.
ncollinteger. Number of algorithm runs where collinear regressor matrices occurred.
grtovector of integers. Numbers of FPCs to which algorithm runs led, which were started by init.group.
imatrixvector of integers. Size of intersection between FPCs. See sseg.
smatrixnumerical vector. Similarities between FPCs. See sseg.
stninteger. Number of representative FPCs of stable groups. In summary.rfpc sorted according to stfound.
stfoundvector of integers. Number of findings of members of all groups of FPCs. In summary.rfpc sorted according to stfound.
sfpcvector of integers. Numbers of representative FPCs.
ssigvector of integers. As sfpc, but only for stable groups.
stovector of integers. Number of representative FPC of most, 2nd most, ..., often found group of FPCs.
strucvector of integers. Number of group an FPC belongs to.
nsee arguments.
psee arguments.
casee arguments.
irsee arguments.
mncsee arguments.
mtfsee arguments.
distcutsee arguments.
snvector of integers. Number of points of representative FPCs.
sernumerical vector. Expectation ratio for stable groups.
simvector of integers. Size of intersections between representative FPCs of stable groups. See sseg.
coefvector of regression coefficients.
varerror variance.
glogical indicator vector of iterated FPC.
colllogical. TRUE means that singular covariance matrices occurred during the iterations.

Details

A linear regression FPC is a data subset that reproduces itself under the following operation: Compute linear regression and error variance estimator for the data subset, and compute all points of the dataset for which the squared residual is smaller than ca times the error variance. Fixed points of this operation can be considered as clusters, because they contain only non-outliers (as defined by the above mentioned procedure) and all other points are outliers w.r.t. the subset. fixreg performs ir fixed point algorithms started from random subsets of size p+2 to look for FPCs. Additionally an algorithm is started from the whole dataset, and algorithms are started from the subsets specified in init.group. Usually some of the FPCs are unstable, and more than one FPC may correspond to the same significant pattern in the data. Therefore the number of FPCs is reduced: FPCs with less than mnc points are ignored. Then a similarity matrix is computed between the remaining FPCs. Similarity between sets is defined as the ratio between 2 times size of intersection and the sum of sizes of both sets. The Single Linkage clusters (groups) of level distcut are computed, i.e. the connectivity components of the graph where edges are drawn between FPCs with similarity larger than distcut. Groups of FPCs whose members are found mtf times or more are considered as stable enough. A representative FPC is chosen for every Single Linkage cluster of FPCs according to the maximum expectation ratio ser. ser is the ratio between the number of findings of an FPC and the estimated expectation of the number of findings of an FPC of this size, called expectation ratio and computed by clusexpect. Usually only the representative FPCs of stable groups are of interest. The choice of the involved tuning constants such as ca and ir is discussed in detail in Hennig (2002a). Statistical theory is presented in Hennig (2003). Generally, the default settings are recommended for fixreg. In cases where they lead to a too large number of algorithm runs (e.g., always for p>4), the use of init.group together with mtf=1 and ir=0 is useful. Occasionally, irnc may be chosen smaller than the default, if smaller clusters are of interest, but this may lead to too many clusters and too many algorithm runs. Decrease of ca will often lead to too many clusters, even for homogeneous data. Increase of ca will produce only very strongly separated clusters. Both may be of interest occasionally.

rfpi is called by fixreg for a single fixed point algorithm and will usually not be executed alone.

summary.rfpc gives a summary about the representative FPCs of stable groups.

plot.rfpc is a plot method for the representative FPC of stable group no. no. It produces a scatterplot of the linear combination of independent variables determined by the regression coefficients of the FPC vs. the dependent variable. The regression line and the region of non-outliers determined by ca are plotted as well.

fpclusters.rfpc produces a list of indicator vectors for the representative FPCs of stable groups.

References

Hennig, C. (2002) Fixed point clusters for linear regression: computation and comparison, Journal of Classification 19, 249-276.

Hennig, C. (2003) Clusters, outliers and regression: fixed point clusters, Journal of Multivariate Analysis 86, 183-212.

Examples

Run this code

set.seed(190000)
data(tonedata)
attach(tonedata)
tonefix <- fixreg(stretchratio,tuned,mtf=1,ir=20)
summary(tonefix)
# This is designed to have a fast example; default setting would be better.
# If you want to see more (and you have a bit more time),
# try out the following:
# set.seed(1000)
# tonefix <- fixreg(stretchratio,tuned)
## Default - good for these data
# summary(tonefix)
# plot(tonefix,stretchratio,tuned,1)
# plot(tonefix,stretchratio,tuned,2)
# plot(tonefix,stretchratio,tuned,3,bw=FALSE,pch=5) 
# toneclus <- fpclusters(tonefix,stretchratio,tuned)
# plot(stretchratio,tuned,col=1+toneclus[[2]])
# tonefix2 <- fixreg(stretchratio,tuned,distcut=1,mtf=1,countmode=50)
## Every found fixed point cluster is reported,
## no matter how instable it may be.
# summary(tonefix2)
# tonefix3 <- fixreg(stretchratio,tuned,ca=7)
## ca defaults to 10.07 for these data.
# summary(tonefix3)
# subset <- c(rep(FALSE,5),rep(TRUE,24),rep(FALSE,121))
# tonefix4 <- fixreg(stretchratio,tuned,
#                    mtf=1,ir=0,init.group=list(subset))
# summary(tonefix4)

Run the code above in your browser using DataLab