mcClustering: Markov Chain Clustering With And Without Mixtures-of-Experts Extension

Description

This function provides Markov chain clustering with or without multinomial logit model (mixtures-of-experts) extension (see References). That is an MCMC sampler for the mixtures-of-experts extension of Markov chain clustering. It requires four mandatory arguments: Data, Prior, Initial and Mcmc; each representing a list of (mandatory) arguments: Data contains data information, Prior contains prior information, Initial contains information about starting conditions (initial values) and Mcmc contains the setup for the MCMC sampler.

Usage

mcClust( 
    Data = list( 
        dataFile = stop(
 "'dataFile' (=> Njk.i) must be specified: either 'filename' (path) or data"), 
        storeDir = "try01", priorFile = NULL), 
    Prior = list( H = 4, e0 = 4, c = 1, cOff = 1, usePriorFile = FALSE, 
        xiPooled = FALSE, N0 = 5), 
    Initial = list( xi.start.ind = 3, pers = 0.7, S.i.start = NULL), 
    Mcmc = list( M = 50, M0 = 20, mOut = 5, mSave = 10, seed = 12345))


mcClustExtended( 
    Data = list( 
        dataFile = stop(
 "'dataFile' (=> Njk.i) must be specified: either 'filename' (path) or data"), 
        storeDir = "try01", priorFile = NULL, 
        X = stop("X (matrix of covariates) must be specified")), 
    Prior = list( H = 4, c = 1, cOff = 1, usePriorFile = FALSE, 
        xiPooled = FALSE, N0 = 5, betaPrior = "informative", 
        betaPriorMean = 0, betaPriorVar = 1), 
    Initial = list( xi.start.ind = 3, pers = 0.7, 
        S.i.start = rep(1:H, N), Beta.start = NULL), 
    Mcmc = list( M = 50, M0 = 20, mOut = 5, mSave = 10, 
                 seed = 12345))

Arguments

Value

A list containing (/the output file contains):workspaceFileA character indicating the name of and the path (based on the currend working directory) to the output file, wherein all the results are saved. The name of the output file starts with MCC_ or MCC_Logit_newAux_ respectively followed by the number of groups H, the number of iterations M and the particular point in time when the function was called, with format: yyyymmdd_hhmmss. E.g. MCC_H4_M10000_20110218_045254.RData or MCC_Logit_newAux_H4_M10000_20111121_165723.RData.DataThe argument Data.PriorThe argument Prior.InitialThe argument Initial.McmcThe argument Mcmc.Beta.mA 3-dimensional array of dimension $\code{ncol(X)} \times H \times M$ containing the draws for the regression coefficients $\beta_h$ in each $m$-th iteration step.bk0The prior parameters for the mean vectors of the normal (prior) distributions of the regression coefficients.Bk0invThe prior parameters for the inverse variance-covariance matrices of the normal (prior) distributions of the regression coefficients.c0A 3-dimensional array with dimension $(K+1) \times (K+1) \times H$ that contains the finally used a priori parameter values for $\boldsymbol{\xi}_h$.estTransProbA 3-dimensional array with dimension $(K+1) \times (K+1) \times H$ that contains the ergodic average of $\boldsymbol{\xi}_h$ for all groups (using draws from M0 to M without thinning parameter).fileNameA character value indicating the name of the output file (see also workspaceFile).freqmatrix-matching (pattern recognition): a numerical vector containing the frequencies of different (!) transition matrices. (in ascending order)indizesmatrix-matching (pattern recognition): a numerical vector containing the indices of different (!) transition matrices.KAn integer indicating the number of categories minus one (!). See Note.mccThe prior data (see Section Prior Data) provided with priorFile, NULL otherwise.NAn integer indicating $N$, the number of individuals/units/objects.Njk.iThe data (see Details) provided with dataFile.Njk.i.indmatrix-matching (pattern recognition): the resulting Njk.i after matrix-matching.Rmatrix-matching (pattern recognition): number of different (!) transition matrices.S.i.countsA $N \times H$-matrix containing the frequencies how often individual $i$ was allocated to a certain group during the iterations from M0+1 to code{M}.totalTimeA numeric value indicating the total time (in secs) used for the function call.xi.hatA matrix with dimension $(K+1) \times (K+1)$ containing the empirical transition probabilities (overall relative transition freqs).xi.mA 4-dimensional array of dimension $M \times (K+1) \times (K+1) \times H$ containing the draws for $\boldsymbol{\xi}_h$ in each $m$-th iteration step.xi.startA matrix of dimension $(K+1) \times (K+1)$ that contains the starting values for $\boldsymbol{\xi}_h$ (only if xi.start.ind = 3).xi.start.indAn integer indicating the used method to calculate/determine the starting values for $\boldsymbol{\xi}_h$.bkNThe posterior parameters (in the last iteration step) for the mean vectors of the normal (posterior) distributions from which the regression coefficients were drawn.BkNThe posterior parameters (in the last iteration step) for the variance-covariance matrices of the normal (posterior) distributions from which the regression coefficients were drawn.logLikeA vector containing the values of the log-likelihood calculated in each iteration step.logBetaPriorA vector containing the values of the prior distribution for the regression coefficients calculated in each iteration step.logXiPriorA vector containing the values of the prior distribution for the transition matrices calculated in each iteration step.logPostDensA vector containing the values of the posterior density calculated in each iteration step.mMaxAn integer giving the position (number of iteration) of the maximum value in the posterior density logPostDens.logClassLikeA vector containing the values of the log classification likelihood calculated in each iteration step.entropyA vector containing the values of the entropy calculated in each iteration step.eta.startEither a numeric value equal to 1/H or if xi.start.ind = 4 the corresponding data (vector) from the prior file.estGroupSizeA numerical vector containing the ergodic average of $\eta_h$ for all groups (using draws from M0+1 to code{M} without thinning parameter).eta.mA matrix of dimension $H \times M$ containing the draws for $\eta_h$ in each $m$-th iteration step.logEtaPriorA vector containing the values of the prior distribution for the mixing proportions (group sizes) calculated in each iteration step.

Prior Data

The prior data (called mcc in the following) -- to be passed via priorFile in argument-list Data -- has to be a list of lists, indexed by $1,\ldots,H,H+1,\ldots$. Note that, depending on parameter $H$ (the number of groups -- to be passed via H in argument-list Prior), there have to be at least $H$ entries (each a list). See mccXiPrior in LMEntryPaperData for example. Within a call to dmClustering or mcClustering, at least mcc[[H]] has to be provided as a list containing eta and xi. eta is a vector of length $H$ containing prior information about the relative group sizes of group $h=1,\ldots,H$. xi is a 3-dimensional array of dimension $(K+1) \times (K+1) \times H$, containing prior information in terms of probabilities about the transition probabilities of group $h=1,\ldots,H$ (see examples).

Reporting Progress (Log Protocol)

The log protocol keeps record of the progress of the estimation procedure and is shown on the screen. At first the name of the workspace file is documented. Then for each mOut-th iteration step (at least for $m=1, \ldots, 5, 10, 20, 50, 100, 200, 500$) information about the elapsed time and the expected time to the end is reported. Finally the total time is shown. For example: workspaceFile: tryN50000-sample02-01\MCC_Logit_newAux_H4_M10000_20111124_155650.RData (within current working directory!) m = 1 ; duration of iter proc so far: 13.75 sec. m = 2 ; duration of iter proc so far: 21.59 sec., exp time to end: 3597.97 min. m = 3 ; duration of iter proc so far: 29.48 sec., exp time to end: 2456.18 min. m = 4 ; duration of iter proc so far: 37.36 sec., exp time to end: 2074.93 min. m = 5 ; duration of iter proc so far: 45.25 sec., exp time to end: 1884.66 min. m = 10 ; duration of iter proc so far: 84.94 sec., exp time to end: 1571.55 min. m = 20 ; duration of iter proc so far: 164.5 sec., exp time to end: 1440.24 min. m = 50 ; duration of iter proc so far: 403.08 sec., exp time to end: 1364.3 min. m = 100 ; duration of iter proc so far: 801.15 sec., exp time to end: 1335.38 min. m = 200 ; duration of iter proc so far: 1530.5 sec., exp time to end: 1256.32 min. m = 400 ; duration of iter proc so far: 3074.03 sec., exp time to end: 1232.82 min. m = 500 ; duration of iter proc so far: 3804.67 sec., exp time to end: 1207.35 min. m = 600 ; duration of iter proc so far: 4532.04 sec., exp time to end: 1185.47 min. m = 800 ; duration of iter proc so far: 6075.54 sec., exp time to end: 1166.06 min. m = 1000 ; duration of iter proc so far: 7715.48 sec., exp time to end: 1158.61 min. ...

Warning

Note that there are no such things as default values (see Section Arguments)!

Details

Note that the values of the arguments indicated here have nothing to do with default values! For a call of these functions this lists-of-arguments structure requires a complete specification of all arguments! The following arguments which are lists have to be completely provided (note that there are no such things as default values within lists!): Data contains: [object Object],[object Object],[object Object],[object Object] Prior contains (see also Section Prior Data): [object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object] Initial contains: [object Object],[object Object],[object Object],[object Object] Mcmc contains: [object Object],[object Object],[object Object],[object Object],[object Object]

References

Sylvia Fruehwirth-Schnatter, Christoph Pamminger, Andrea Weber and Rudolf Winter-Ebmer, (2011), "Labor market entry and earnings dynamics: Bayesian inference using mixtures-of-experts Markov chain clustering". Journal of Applied Econometrics. DOI: 10.1002/jae.1249 http://onlinelibrary.wiley.com/doi/10.1002/jae.1249/abstract Christoph Pamminger and Sylvia Fruehwirth-Schnatter, (2010), "Model-based Clustering of Categorical Time Series". Bayesian Analysis, Vol. 5, No. 2, pp. 345-368. DOI: 10.1214/10-BA606 http://ba.stat.cmu.edu/journal/2010/vol05/issue02/pamminger.pdf Sylvia Fruehwirth-Schnatter and Rudolf Fruehwirth, (2010), "Data augmentation and MCMC for binary and multinomial logit models". In T. Kneib and G. Tutz (eds): Statistical Modelling and Regression Structures: Festschrift in Honour of Ludwig Fahrmeir. Physica Verlag, Heidelberg, pp. 111-132. DOI: 10.1007/978-3-7908-2413-1_7 http://www.springerlink.com/content/t4h810017645wh68/. See also: IFAS Research Paper Series 2010-48 (http://www.jku.at/ifas/content/e108280/e108491/e108471/e109880/ifas_rp48.pdf).

Examples

Run this code

#rm(list=ls(all=TRUE))

# ==================================================================================
if ( TRUE ) {
# ==================================================================================

# set working directory
oldDir <- getwd()
curDir <- tempdir()
setwd(curDir)

if ( !file.exists("bayesMCClust-wd") ) dir.create("bayesMCClust-wd")
setwd("bayesMCClust-wd")
myOutfilesDir <- "mcClust-Example-Outfiles" 

# load data 
data(MCCExampleData)

# ==================================================================================

# function call 
system.time(
  outList <- mcClust(    # parameter lists (every four) must be complete!
     Data=list(dataFile=MCCExampleData$Njk.i, 
               storeDir=myOutfilesDir,
               priorFile= NULL), 
     Prior=list(H=2, # sample(2:6, 1), # 4 
                e0=4, 
                c=1,
                cOff=1,
                usePriorFile=FALSE,
                xiPooled=FALSE,
                N0=5), 
     Initial=list(xi.start.ind=3, 
                  pers=0.7), 
     Mcmc=list(M=100, 
               M0=20, 
               mOut=5, 
               mSave=50, 
               seed=sample(1:100000, 1) # 123 
     ) 
  )
)

str(outList)

#outFileName
#results <- load(outFileName)
#results
#estTransProb

allocList <- calcAllocationsMCC(outList, thin=1, maxi=50) # , plotPathsForEta=TRUE
str(allocList)

myTransProbs <- calcTransProbs(outList, estGroupSize=allocList$estGroupSize, thin=1, 
    printXtable=FALSE, printSd=FALSE, printTogether=TRUE ) 
    # , plotPaths=TRUE, grLabels=paste("Group", 1:Prior$H)
str(myTransProbs)

myTransList <- plotTransProbs(outList, estTransProb=myTransProbs$estTransProb, 
    estGroupSize=allocList$estGroupSize, class=allocList$class, plotPooled=TRUE, 
    plotContTable=TRUE, printContTable=TRUE, plotContPooled=TRUE) 
    # , grLabels=paste("Group", 1:Prior$H)
str(myTransList)

(equiDist <- calcEquiDist(outList, thin=1, maxi=50)) 
#, printEquiDist=TRUE, plotEquiDist=TRUE , grLabels=paste("Group", 1:Prior$H)

myLongRunDistList <- calcLongRunDist(outList, 
    initialStateData=MCCExampleData$initialState, 
    class=allocList$class, equiDist=equiDist, maxi=50) 
    # , printLongRunDist=TRUE, grLabels=paste("Group", 1:Prior$H)
str(myLongRunDistList)

myTypicalMembs <- plotTypicalMembers(outList, moreTypMemb=c(10,25,40,55,70,85,100), 
    myObsList=MCCExampleData$obsList, classProbs=allocList$classProbs) # noTypMemb=7
str(myTypicalMembs)

plotScatter(outList, thin=1, xi11=c(1,1), xi12=c(2,2), xi21=c(2,2), xi22=c(3,3), 
    xi31=c(1,1), xi32=c(3,3) )

mySegPower <- calcSegmentationPower(outList, classProbs=allocList$classProbs, 
    class=allocList$class, printXtable=TRUE, calcSharp=TRUE, printSharpXtable=TRUE ) 
    # , grLabels=paste("Group", 1:Prior$H)
str(mySegPower)

myEntropy <- calcEntropy(outList, classProbs=allocList$classProbs, 
    class=allocList$class, printXtable=TRUE ) 
    # , grLabels=paste("Group", 1:Prior$H)
myEntropy

plotLikeliPaths(outList, from=10, by=1 )

myNumEffTables <- calcNumEff( outList, thin=1, printXi=TRUE, printE=TRUE, 
    printBeta=TRUE, grLabels=paste("Group", 1:outList$Prior$H) ) 
str(myNumEffTables)

myMSCrits <- calcMSCritMCC(workDir=myOutfilesDir, myLabel="mcClust-Example", H0=4, 
    whatToDoList=c("approxML", "approxMCL", "postMode") ) 
str(myMSCrits)

setwd(oldDir)

} # end if

# ==================================================================================
# ==================================================================================
# ==================================================================================

# ==================================================================================
if ( FALSE ) {
# ==================================================================================

rm(list=ls(all=TRUE))

# set working directory
oldDir <- getwd()
curDir <- tempdir()
setwd(curDir)

if ( !file.exists("bayesMCClust-wd") ) dir.create("bayesMCClust-wd")
setwd("bayesMCClust-wd") 
myOutfilesDir <- "mcClustExtended-Example-Outfiles"      

# load data 
data(MCCExtExampleData)
if (!is.element("MCCExtExampleData$covariates", search())) { 
    attach(MCCExtExampleData$covariates)
}

# ==================================================================================

groupNr <- 2 # sample(2:6, 1) # 3

# ==================================================================================

results <- kmeans( log( MCCExtExampleData$NjkiMat + 0.5 ) , groupNr, nstart=2)

# ==================================================================================

require(nnet, quietly = TRUE)
H <- groupNr
X = cbind( intercept=1, alrateBezNew, unskilled, skilled, angStart ) 

N <- dim(X)[1]
mX <- data.frame( cbind(group=as.factor( results$cluster ), X[,-1], 
    matrix(sample(1:H,H*N,replace=TRUE),N,H)) )

colnames(mX)[6:(6+groupNr-1)] <- 
    c( "as.1", "as.2", "as.3", "as.4", "as.5", "as.6" )[1:groupNr] 

tempMNom <- multinom(group ~ alrateBezNew+ unskilled+ skilled+ angStart, 
    data=as.data.frame(mX)) 

toStartBeta <- t(rbind(0,coef( tempMNom )))

# ==================================================================================
# function call 
outList <- mcClustExtended(      
     Data=list(dataFile=MCCExtExampleData$Njk.i, # parameter lists must be complete!!!
               storeDir=myOutfilesDir,
               priorFile= NULL,
               X = cbind( intercept=1, alrateBezNew, unskilled, skilled, angStart ) ), 
     Prior=list(H=groupNr, 
                c=1,
                cOff=1,
                usePriorFile=FALSE,
                xiPooled=FALSE,
                N0=5,
                betaPrior = "informative", # N(0,1)
                betaPriorMean = 0,
                betaPriorVar = 1),
     Initial=list(xi.start.ind=3, 
                  pers=0.7,
                  S.i.start = results$cluster,
                  Beta.start = toStartBeta ), 
     Mcmc=list(M=100, 
               M0=50, 
               mOut=10, 
               mSave=50, 
               seed=sample(1:100000, 1) # 69814651 
              ) 
     )

str(outList)

#outFileName <- outList$workspaceFile
#results <- load(outFileName)
#results
#estTransProb

allocList <- calcAllocationsMCCExt(outList, thin=1, maxi=50) 
str(allocList)

myTransProbs <- calcTransProbs(outList, estGroupSize=allocList$estGroupSize, thin=1, 
    printXtable=FALSE, printSd=FALSE, printTogether=TRUE ) 
    # plotPaths=TRUE, grLabels=paste("Group", 1:Prior$H)
str(myTransProbs)

myTransList <- plotTransProbs(outList, estTransProb=myTransProbs$estTransProb, 
    estGroupSize=allocList$estGroupSize, class=allocList$class, plotPooled=TRUE, 
    plotContTable=TRUE, printContTable=TRUE, plotContPooled=TRUE) 
    # , grLabels=paste("Group", 1:Prior$H)
str(myTransList)

(equiDist <- calcEquiDist(outList, thin=1, maxi=50)) 
# , printEquiDist=TRUE, plotEquiDist=TRUE, grLabels=paste("Group", 1:Prior$H)

myRegCoeffs <- calcRegCoeffs(outList, hBase=2, thin=1) 
#, M0=Mcmc$M0, grLabels=paste("Group", 1:Prior$H), 
# printHPD=TRUE, plotPaths=TRUE, plotACFs=TRUE
str(myRegCoeffs)

myLongRunDistList <- calcLongRunDist(outList, initialStateData=initialState, 
    class=allocList$class, equiDist=equiDist, maxi=50) 
    # , printLongRunDist=TRUE
str(myLongRunDistList)

myTypicalMembs <- plotTypicalMembers(outList, myObsList=MCCExtExampleData$obsList, 
    classProbs=allocList$classProbs) 
    # , noTypMemb=7, moreTypMemb=c(10,25,50,100,200,500,1000)
str(myTypicalMembs)

plotScatter(outList, thin=1, xi11=c(1,1), xi12=c(2,2), xi21=c(2,2), xi22=c(3,3), 
    xi31=c(1,1), xi32=c(3,3) )

mySegPower <- calcSegmentationPower(outList, classProbs=allocList$classProbs, 
    class=allocList$class, printXtable=TRUE, calcSharp=TRUE, printSharpXtable=TRUE ) 
    # , grLabels=paste("Group", 1:Prior$H)
str(mySegPower)

myEntropy <- calcEntropy(outList, classProbs=allocList$classProbs, 
    class=allocList$class, printXtable=TRUE ) 
    # , grLabels=paste("Group", 1:Prior$H)
myEntropy

plotLikeliPaths(outList, from=10, by=1 )

myNumEffTables <- calcNumEff( outList, thin=1, printXi=TRUE, printE=TRUE, 
    printBeta=TRUE, grLabels=paste("Group", 1:outList$Prior$H) ) 
str(myNumEffTables)

myMSCrits <- calcMSCritMCCExt(workDir=myOutfilesDir, NN=outList$N, 
    myLabel="mcClustExtended-Example", ISdraws=100, H0=3, 
    whatToDoList=c("approxML", "approxMCL", "postMode" ) ) 
str(myMSCrits)

setwd(oldDir)

# ==================================================================================

if (is.element("MCCExtExampleData$covariates", search())) { 
    detach(MCCExtExampleData$covariates)
}

# ==================================================================================
} # end if
# ==================================================================================

# ==================================================================================
# ==================================================================================

Run the code above in your browser using DataLab