regmix: Mixture Model ML for Clusterwise Linear Regression

Description

Computes an ML-estimator for clusterwise linear regression under a regression mixture model with Normal errors. Parameters are proportions, regression coefficients and error variances, all independent of the values of the independent variable, and all may differ for different clusters. Computation is by the EM-algorithm. The number of clusters is estimated via the Bayesian Information Criterion (BIC). Note that package flexmix has more sophisticated tools to do the same thing and is recommended. The functions are kept in here only for compatibility reasons.

Usage

regmix(indep, dep, ir=1, nclust=1:7, icrit=1.e-5, minsig=1.e-6, warnings=FALSE)
regem(indep, dep, m, cln, icrit=1.e-5, minsig=1.e-6, warnings=FALSE)

Arguments

indep

numerical matrix or vector. Independent variables.

dep

numerical vector. Dependent variable.

positive integer. Number of iteration runs for every number of clusters.

nclust

vector of positive integers. Numbers of clusters.

icrit

positive numerical. Stopping criterion for the iterations (difference of loglikelihoods).

minsig

positive numerical. Minimum value for the variance parameters (likelihood is unbounded if variances are allowed to converge to 0).

warnings

logical. If TRUE, warnings are given during the EM iteration in case of collinear regressors, too small mixture components and error variances smaller than minimum. In the former two cases, the algorithm is terminated without a result, but an optimal solution is still computed from other algorithm runs (if there are others). In the latter case, the corresponding variance is set to the minimum.

cln

positive integer. (Single) number of clusters.

matrix of positive numericals. Number of columns must be cln. Number of rows must be number of data points. Columns must add up to 1. Initial configuration for the EM iteration in terms of a probabilty vector for every point which gives its degree of membership to every cluster. As generated by randcmatrix.

Value

regmix returns a list containing the components clnopt, loglik, bic, coef, var, eps, z, g.

regem returns a list containing the components loglik, coef, var, z, g, warn.

clnopt

optimal number of clusters according to the BIC.

loglik

loglikelihood for the optimal model.

bic

vector of BIC values for all numbers of clusters in nclust.

coef

matrix of regression coefficients. First row: intercept parameter. Second row: parameter of first independent variable and so on. Columns corresponding to clusters.

var

vector of error variance estimators for the clusters.

eps

vector of cluster proportion estimators.

matrix of estimated a posteriori probabilities of the points (rows) to be generated by the clusters (columns). Compare input argument m.

integer vector of estimated cluster numbers for the points (via argmax over z).

warn

logical. TRUE if one of the estimated clusters has too few points and/or collinear regressors.

Details

The result of the EM iteration depends on the initial configuration, which is generated randomly by randcmatrix for regmix. regmix calls regem. To provide the initial configuration manually, use parameter m of regem directly. Take a look at the example about how to generate m if you want to specify initial parameters.

The original paper DeSarbo and Cron (1988) suggests the AIC for estimating the number of clusters. The use of the BIC is advocated by Wedel and DeSarbo (1995). The BIC is defined here as 2*loglik - log(n)*((p+3)*cln-1), p being the number of independent variables, i.e., the larger the better.

See the entry for the input parameter warnings for the treatment of several numerical problems.

References

DeSarbo, W. S. and Cron, W. L. (1988) A maximum likelihood methodology for clusterwise linear regression, Journal of Classification 5, 249-282.

Wedel, M. and DeSarbo, W. S. (1995) A mixture likelihood approach for generalized linear models, Journal of Classification 12, 21-56.

Examples

Run this code

# NOT RUN {
# This apparently gives slightly different
# but data-analytically fine results
# on some versions of R.
set.seed(12234)
data(tonedata)
attach(tonedata)
rmt1 <- regmix(stretchratio,tuned,nclust=1:2)
# nclust=1:2 makes the example fast;
# a more serious application would rather use the default.
rmt1$g
round(rmt1$bic,digits=2)
# start with initial parameter values
cln <- 3
n <- 150
initcoef <- cbind(c(2,0),c(0,1),c(0,2.5))
initvar <- c(0.001,0.0001,0.5)
initeps <- c(0.4,0.3,0.3)
# computation of m from initial parameters
m <- matrix(nrow=n, ncol=cln)
stm <- numeric(0)
for (i in 1:cln)
  for (j in 1:n){
    m[j,i] <- initeps[i]*dnorm(tuned[j],mean=initcoef[1,i]+
              initcoef[2,i]*stretchratio[j], sd=sqrt(initvar[i]))
  }
  for (j in 1:n){
    stm[j] <- sum(m[j,])
    for (i in 1:cln)
      m[j,i] <- m[j,i]/stm[j]
  } 
rmt2 <- regem(stretchratio, tuned, m, cln)
# }

Run the code above in your browser using DataLab