dmClustering: Dirichlet Multinomial Clustering With And Without Mixtures-of-Experts Extension

Description

This function provides Dirichlet Multinomial Clustering with or without multinomial logit model (mixtures-of-experts) extension (see References). That is an MCMC sampler for the mixtures-of-experts extension of Dirichlet Multinomial clustering. It requires four mandatory arguments: Data, Prior, Initial and Mcmc; each representing a list of (mandatory) arguments: Data contains data information, Prior contains prior information, Initial contains information about starting conditions (initial values) and Mcmc contains the setup for the MCMC sampler.

Usage

dmClust( 
    Data = list( 
        dataFile = 
            stop("'dataFile' must be specified: filename or data"), 
        storeDir = "try01", mccFile = "mcc.RData"), 
    Prior = list( H = 4, alpha0 = 4, a0 = 1, alpha = 1, N0 = 10, 
        isPriorNegBin = FALSE, mccAsPrior = FALSE, 
        xiPooled = TRUE, persPrior = 7/10), 
    Initial = list( mccUse = FALSE, pers = 1/6, S.i.start = NULL), 
    Mcmc = list( kNo = 2, M = 50, M0 = 20, mOut = 5, mSave = 10, 
        showAcc = TRUE, monitor = FALSE, seed = 12345))
                      
dmClustExtended( 
    Data = list( 
        dataFile = 
            stop("'dataFile' must be specified: filename or data"), 
        storeDir = "try01", mccFile = "mcc.RData", 
        X = stop("X (matrix of covariates) must be specified")), 
    Prior = list( H = 4, a0 = 1, alpha = 1, N0 = 10, 
        isPriorNegBin = FALSE, mccAsPrior = FALSE, 
        xiPooled = TRUE, persPrior = 7/10, 
        betaPrior = "informative", betaPriorMean = 0, 
        betaPriorVar = 1), 
    Initial = list( mccUse = FALSE, pers = 1/6, 
        S.i.start = rep(1:H, N), Beta.start = NULL), 
    Mcmc = list( kNo = 2, M = 50, M0 = 20, mOut = 5, mSave = 10, 
        showAcc = TRUE, monitor = FALSE, seed = 12345))

Arguments

Value

A list containing (/the output file contains):workspaceFileA character indicating the name of and the path (based on the currend working directory) to the output file, wherein all the results are saved. The name of the output file starts with DMC_ or DMC_Logit_newAux_ respectively followed by the number of groups H, the number of iterations M and the particular point in time when the function was called, with format: yyyymmdd_hhmmss. E.g. DMC_H4_M10000_20110218_045254.RData or DMC_Logit_newAux_H4_M10000_20111121_165723.RData.acceptA 3-dimensional array with dimension $M \times H*(K+1) \times 2$. This array contains the (calculated) acceptance probabilities (accProb) of the M-H-algorithm and whether the draw(s) were accepted or not (accYesNo) for each row $j$ in each group $h$ in the $m$-th iteration. The first dimension indicates the $m$-th iteration, the second dim row $1,\ldots,K+1$ in group 1, then row $1,\ldots,K+1$ in group 2 and so on. The third dim indicates accProb and accYesNo.Beta.mA 3-dimensional array of dimension $\code{ncol(X)} \times H \times M$ containing the draws for the regression coefficients $\beta_h$ in each $m$-th iteration step.bk0The prior parameters for the mean vectors of the normal (prior) distributions of the regression coefficients.Bk0invThe prior parameters for the inverse variance-covariance matrices of the normal (prior) distributions of the regression coefficients.DataThe argument Data.e_h_0A 3-dimensional array with dimension $K+1 \times K+1 \times H$ containing the (calculated) initial values for $\bold{e}_h$.e_h_mA 4-dimensional array with dimension $K+1 \times K+1 \times H \times M$ containing the draws for $\bold{e}_h$ in the $m$-th iteration step.eta_mA matrix of dimension $M \times H$ containing the draws for $\eta_h$ in each $m$-th iteration step.fileNameA character value indicating the name of the output file (see also workspaceFile).InitialThe argument Initial.KAn integer indicating the number of categories minus one (!). See Note.logFileNameA character value indicating the name of the log file and the corresponding directory.mccThe prior data (see Section Prior Data) provided with mccFile, NULL otherwise.McmcThe argument Mcmc.NAn integer indicating $N$, the number of individuals/units/objects.Njk.iThe data (see Details) provided with dataFile.PriorThe argument Prior.S_i_freqA $H \times N$-matrix containing the frequencies how often individual $i$ was allocated to a certain group during the iterations from M0+1 to code{M}.xi_h_mA 4-dimensional array of dimension $(K+1) \times (K+1) \times H \times M$ containing the draws for $\boldsymbol{\xi}_h$ in each $m$-th iteration step.xi_priorA 3-dimensional array of dimension $(K+1) \times (K+1) \times H$ that contains the finally used a priori parameter values for $\boldsymbol{\xi}_h$.bkNThe posterior parameters (in the last iteration step) for the mean vectors of the normal (posterior) distributions from which the regression coefficients were drawn.BkNThe posterior parameters (in the last iteration step) for the variance-covariance matrices of the normal (posterior) distributions from which the regression coefficients were drawn.logLikeA vector containing the values of the log-likelihood calculated in each iteration step.logBetaPriorA vector containing the values of the prior distribution for the regression coefficients calculated in each iteration step.logEPriorA vector containing the values of the prior distribution for $\bold{e}$ calculated in each iteration step.logPostDensA vector containing the values of the posterior density calculated in each iteration step.mMaxAn integer giving the position (number of iteration) of the maximum value in the posterior density logPostDens.logClassLikeA vector containing the values of the log classification likelihood calculated in each iteration step.entropyA vector containing the values of the entropy calculated in each iteration step.logEtaPriorA vector containing the values of the prior distribution for the mixing proportions (group sizes) calculated in each iteration step.

Prior Data

The prior data (called mcc in the following) -- to be passed via mccFile in argument-list Data -- has to be a list of lists, indexed by $1,\ldots,H,H+1,\ldots$. Note that, depending on parameter $H$ (the number of groups -- to be passed via H in argument-list Prior), there have to be at least $H$ entries (each a list). See mccXiPrior in LMEntryPaperData for example. Within a call to dmClustering or mcClustering, at least mcc[[H]] has to be provided as a list containing eta and xi. eta is a vector of length $H$ containing prior information about the relative group sizes of group $h=1,\ldots,H$. xi is a 3-dimensional array of dimension $(K+1) \times (K+1) \times H$, containing prior information in terms of probabilities about the transition probabilities of group $h=1,\ldots,H$ (see examples).

Log File

The log file keeps record of the progress of the estimation procedure (which is also shown on the screen). At first some prior parameters and the MCMC-settings and the name of the output file are documented. Then for each mOut-th iteration step (at least for $m=1, \ldots, 5, 10, 20, 50, 100, 200, 500$) information about the elapsed time and the expected time to the end and optionally the current acceptance rate (showAcc=TRUE) is indicated. Finally the total time is shown. For example: Data loaded! Data Information: Datafile = no file name , N = 9809 , K = 5 Manual Settings: No of groups H = 4 , kNo = 2 MCMC Parameters: M = 10000 , M0 = 5000 , mOut = 200 , mSave = 5000 , seed = 123456 , showAcc = TRUE Prior Parameters for e_h (Neg Multinom): a0 = 1 , alpha = 1 , N0 = 10 , xi_prior (see below) Information on xi_prior (for Neg Bin/Neg Multinom Prior): with persPrior = 0.7 created xi_prior (equal in each group, mccAsPrior = FALSE & xiPooled = FALSE) Prior information and parameters set! Inital Values Information: mccUse = FALSE , pers = 0.7 Initial values set! Initialisations done! MCMC Iteration... m = 1 ; Acc Rate of first draws = 0.54 m = 2 ; duration of iter proc so far: 8.17 sec. , exp time to end: 1361.53 min. ; Acc Rate of last 2 draws = 0.5 m = 3 ; duration of iter proc so far: 16.45 sec. , exp time to end: 1370.56 min. ; Acc Rate of last 3 draws = 0.49 m = 4 ; duration of iter proc so far: 24.62 sec. , exp time to end: 1367.37 min. ; Acc Rate of last 4 draws = 0.49 m = 5 ; duration of iter proc so far: 32.84 sec. , exp time to end: 1367.79 min. ; Acc Rate of last 5 draws = 0.53 m = 10 ; duration of iter proc so far: 73.97 sec. , exp time to end: 1368.58 min. ; Acc Rate of last 10 draws = 0.57 m = 20 ; duration of iter proc so far: 156.61 sec. , exp time to end: 1371.16 min. ; Acc Rate of last 20 draws = 0.61 m = 50 ; duration of iter proc so far: 404.42 sec. , exp time to end: 1368.84 min. ; Acc Rate of last 50 draws = 0.59 m = 100 ; duration of iter proc so far: 815.86 sec. , exp time to end: 1359.9 min. ; Acc Rate of last 100 draws = 0.61 m = 200 ; duration of iter proc so far: 1635.61 sec. , exp time to end: 1342.6 min. ; Acc Rate of last 200 draws = 0.62 m = 400 ; duration of iter proc so far: 3270.83 sec. , exp time to end: 1311.75 min. ; Acc Rate of last 200 draws = 0.51 m = 500 ; duration of iter proc so far: 4087.97 sec. , exp time to end: 1297.25 min. ; Acc Rate of last 200 draws = 0.47 m = 1000 ; duration of iter proc so far: 8165.91 sec. , exp time to end: 1226.25 min. ; Acc Rate of last 200 draws = 0.45 ... m = 10000 ; duration of iter proc so far: 81362.58 sec. , exp time to end: 0.14 min. ; Acc Rate of last 200 draws = 0.46 Total time: 22 hours 36 min

Warning

Note that there are no such things as default values (see Section Arguments)!

Details

Note that the values of the arguments indicated here have nothing to do with default values! For a call of these functions this lists-of-arguments structure requires a complete specification of all arguments! The following arguments which are lists have to be completely provided (note that there are no such things as default values within lists!): Data contains: [object Object],[object Object],[object Object],[object Object] Prior contains (see also Section Prior Data): [object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object] Initial contains: [object Object],[object Object],[object Object],[object Object] Mcmc contains: [object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

References

Sylvia Fruehwirth-Schnatter, Christoph Pamminger, Andrea Weber and Rudolf Winter-Ebmer, (2011), "Labor market entry and earnings dynamics: Bayesian inference using mixtures-of-experts Markov chain clustering". Journal of Applied Econometrics. DOI: 10.1002/jae.1249 http://onlinelibrary.wiley.com/doi/10.1002/jae.1249/abstract Christoph Pamminger and Sylvia Fruehwirth-Schnatter, (2010), "Model-based Clustering of Categorical Time Series". Bayesian Analysis, Vol. 5, No. 2, pp. 345-368. DOI: 10.1214/10-BA606 http://ba.stat.cmu.edu/journal/2010/vol05/issue02/pamminger.pdf Sylvia Fruehwirth-Schnatter and Rudolf Fruehwirth, (2010), "Data augmentation and MCMC for binary and multinomial logit models". In T. Kneib and G. Tutz (eds): Statistical Modelling and Regression Structures: Festschrift in Honour of Ludwig Fahrmeir. Physica Verlag, Heidelberg, pp. 111-132. DOI: 10.1007/978-3-7908-2413-1_7 http://www.springerlink.com/content/t4h810017645wh68/. See also: IFAS Research Paper Series 2010-48 (http://www.jku.at/ifas/content/e108280/e108491/e108471/e109880/ifas_rp48.pdf).

Examples

Run this code

#rm(list=ls(all=TRUE))

# ==================================================================================
if ( FALSE ) {
# ==================================================================================

# set working directory
oldDir <- getwd()
curDir <- tempdir()
setwd(curDir)

if ( !file.exists("bayesMCClust-wd") ) dir.create("bayesMCClust-wd")
setwd("bayesMCClust-wd")
myOutfilesDir <- "dmClust-Example-Outfiles"

# load data
data(MCCExampleData)

# function call 
system.time(
  outList <- dmClust(   # parameter lists (every four) must be complete!
     Data = list( dataFile=MCCExampleData$Njk.i,  
                     storeDir=myOutfilesDir,       
                     mccFile=MCCExampleData$somePrior),                     
     Prior   = list( H=2, # sample(2:5, 1), # 3 
                     alpha0=4, 
                     a0=1,      
                     alpha=1, 
                     N0=10,
                     isPriorNegBin=FALSE, 
                     mccAsPrior=TRUE, 
                     xiPooled=FALSE, 
                     persPrior=0.7),                     
     Initial = list( mccUse=FALSE, 
                     pers=1/3 ),                     
     Mcmc    = list( kNo=2, 
                     M=100, 
                     M0=20, 
                     mOut=5, 
                     mSave=50, 
                     showAcc=TRUE, 
                     monitor=FALSE, 
                     seed=sample(1:100000, 1) # 12345
     ) 
  )
)

str(outList)

#outFileName
#results <- load(outFileName)
#results

if (outList$Prior$H > 1) { 
    apply(outList$xi_h_m[,,,seq(outList$Mcmc$M0, outList$Mcmc$M, 1)], c(1,2,3), mean) 
    } else { 
    apply(outList$xi_h_m[,,,seq(outList$Mcmc$M0,outList$Mcmc$M,1)], c(1, 2), mean)
}

allocList <- calcAllocationsDMC(outList, thin=1, maxi=50) # , plotPathsForEta=TRUE
str(allocList)

myTransProbs <- calcTransProbs(outList, estGroupSize=allocList$estGroupSize, thin=1, 
    printXtable=FALSE, printSd=FALSE, printTogether=TRUE ) 
    # grLabels=paste("Group", 1:Prior$H), plotPaths=TRUE
str(myTransProbs)

myTransList <- plotTransProbs(outList, estTransProb=myTransProbs$estTransProb, 
    estGroupSize=allocList$estGroupSize, class=allocList$class, plotPooled=TRUE, 
    plotContTable=TRUE, printContTable=TRUE, plotContPooled=TRUE) 
    # , grLabels=paste("Group", 1:Prior$H)
str(myTransList)

(equiDist <- calcEquiDist(outList, thin=1, maxi=50)) 
# , printEquiDist=TRUE, plotEquiDist=TRUE, grLabels=paste("Group", 1:Prior$H)

myVariation <- calcVariationDMC(outList, thin=1, maxi=50) 
# , printVarE=TRUE, printUnobsHet=TRUE, printUnobsHetSd=TRUE, 
# printUnobsHetAll=TRUE, printAllTogether=TRUE, grLabels=paste("Group", 1:Prior$H)
str(myVariation)

myPars <- calcParMatDMC(outList, thin=1) 
# , grLabels=paste("Group", 1:Prior$H), printPar=TRUE
str(myPars)

myLongRunDistList <- calcLongRunDist(outList, 
    initialStateData=MCCExampleData$initialState, 
    class=allocList$class, equiDist=equiDist, thin=1, maxi=5) 
    # , printLongRunDist=TRUE, , grLabels=paste("Group", 1:Prior$H)
str(myLongRunDistList)

myTypicalMembs <- plotTypicalMembers(outList, moreTypMemb=c(10,13,17,20,23,27,30), 
    myObsList=MCCExampleData$obsList, classProbs=allocList$classProbs) 
    # , noTypMemb=7, moreTypMemb=c(10,25,50,100,200,500,1000)
str(myTypicalMembs)

plotScatter(outList, thin=1, xi11=c(1,1), xi12=c(2,2), xi21=c(2,2), 
    xi22=c(3,3), xi31=c(1,1), xi32=c(3,3) )

mySegPower <- calcSegmentationPower(outList, classProbs=allocList$classProbs, 
    class=allocList$class, printXtable=TRUE, calcSharp=TRUE, printSharpXtable=TRUE ) 
    # , grLabels=paste("Group", 1:Prior$H)
str(mySegPower)

myEntropy <- calcEntropy(outList, classProbs=allocList$classProbs, 
    class=allocList$class, printXtable=TRUE ) 
    # , grLabels=paste("Group", 1:Prior$H)
myEntropy

plotLikeliPaths(outList, from=10, by=1 )

myNumEffTables <- calcNumEff( outList, thin=1, printXi=TRUE, printE=TRUE, 
    printBeta=TRUE, grLabels=paste("Group", 1:outList$Prior$H) ) 
str(myNumEffTables)

myMSCrits <- calcMSCritDMC(workDir=myOutfilesDir, myLabel="dmClust-Example", 
    myN0=paste("N0 =",outList$Prior$N0), 
    whatToDoList=c("postMode", "approxML", "approxMCL") )
str(myMSCrits)

setwd(oldDir)

} # end if

# ==================================================================================
# ==================================================================================
# ==================================================================================

# ==================================================================================
if ( FALSE ) {
# ==================================================================================

rm(list=ls(all=TRUE))

# set working directory
oldDir <- getwd()
curDir <- tempdir()
setwd(curDir)

if ( !file.exists("bayesMCClust-wd") ) dir.create("bayesMCClust-wd")
setwd("bayesMCClust-wd") 
myOutfilesDir <- "dmClustExtended-Example-Outfiles"      

# load data 
data(MCCExtExampleData)
if (!is.element("MCCExtExampleData$covariates", search())) {
    attach(MCCExtExampleData$covariates)
}

# ==================================================================================

groupNr <- 2 # sample(2:6, 1) # 3

# ==================================================================================

results <- kmeans( log( MCCExtExampleData$NjkiMat + 0.5 ) , groupNr, nstart=2)

# ==================================================================================

require(nnet, quietly = TRUE)
H <- groupNr
X = cbind( intercept=1, alrateBezNew, unskilled, skilled, angStart ) 

N <- dim(X)[1]
mX <- data.frame( cbind(group=as.factor( results$cluster ), X[,-1], 
    matrix(sample(1:H,H*N,replace=TRUE),N,H)) )

colnames(mX)[6:(6+groupNr-1)] <- 
    c( "as.1", "as.2", "as.3", "as.4", "as.5", "as.6" )[1:groupNr] 

tempMNom <- multinom(group ~ alrateBezNew+ unskilled+ skilled+ angStart, 
    data=as.data.frame(mX)) 

toStartBeta <- t(rbind(0,coef( tempMNom )))

outList <- dmClustExtended(      
     Data = list( dataFile=MCCExtExampleData$Njk.i,   
                  storeDir=myOutfilesDir,       
                  mccFile=NULL,   
                  X = cbind(intercept=1, alrateBezNew, unskilled, skilled, angStart )), 
     Prior   = list( H=groupNr, 
                     a0=1,      
                     alpha=1, 
                     N0=10,
                     isPriorNegBin=FALSE, 
                     mccAsPrior=FALSE, 
                     xiPooled=FALSE, 
                     persPrior=0.7,
                     betaPrior = "informative", # N(0,1)
                     betaPriorMean = 0,
                     betaPriorVar = 1),                     
     Initial = list( mccUse=FALSE, 
                     pers=1/3,
                     S.i.start = results$cluster,
                     Beta.start = toStartBeta ),      
     Mcmc    = list( kNo=2, 
                     M=100, 
                     M0=50, 
                     mOut=10, 
                     mSave=50,  
                     showAcc=TRUE, 
                     monitor=FALSE, 
                     seed=sample(1:100000, 1) # 564847
                   ) 
)

str(outList)

#outFileName <- outList$workspaceFile
#outFileName
#results <- load(outFileName)
#results

if (outList$Prior$H > 1) { 
    apply( outList$xi_h_m[,,,seq(outList$Mcmc$M0, outList$Mcmc$M, 1)], c(1,2,3), mean) 
    } else { 
    apply(outList$xi_h_m[,,,seq(outList$Mcmc$M0,outList$Mcmc$M,1)], c(1, 2), mean) 
}

allocList <- calcAllocationsDMCExt(outList, thin=1, maxi=50) 
str(allocList)

myTransProbs <- calcTransProbs(outList, estGroupSize=allocList$estGroupSize, thin=1, 
    printXtable=FALSE, printSd=FALSE, printTogether=TRUE ) 
    # grLabels=paste("Group", 1:Prior$H), plotPaths=TRUE
str(myTransProbs)

myTransList <- plotTransProbs(outList, estTransProb=myTransProbs$estTransProb, 
    estGroupSize=allocList$estGroupSize, class=allocList$class, plotPooled=TRUE, 
    plotContTable=TRUE, printContTable=TRUE, plotContPooled=TRUE) 
    # , grLabels=paste("Group", 1:Prior$H)
str(myTransList)

(equiDist <- calcEquiDist(outList, thin=1, maxi=50)) 
# , printEquiDist=TRUE, plotEquiDist=TRUE, grLabels=paste("Group", 1:Prior$H)

myVariation <- calcVariationDMC(outList, thin=1, maxi=50) 
# , printVarE=TRUE, printUnobsHet=TRUE, printUnobsHetSd=TRUE, 
# printUnobsHetAll=TRUE, printAllTogether=TRUE, grLabels=paste("Group", 1:Prior$H)
str(myVariation)

myPars <- calcParMatDMC(outList, thin=1) 
# , grLabels=paste("Group", 1:Prior$H), printPar=TRUE
str(myPars)

myRegCoeffs <- calcRegCoeffs(outList, hBase=2, thin=1) 
#, M0=Mcmc$M0, grLabels=paste("Group", 1:Prior$H), printHPD=TRUE, 
# plotPaths=TRUE, plotACFs=TRUE
str(myRegCoeffs)

myLongRunDistList <- calcLongRunDist(outList, initialStateData=initialState, 
    class=allocList$class, equiDist=equiDist, maxi=2) 
    # , printLongRunDist=TRUE
str(myLongRunDistList)

myTypicalMembs <- plotTypicalMembers(outList, myObsList=MCCExtExampleData$obsList, 
    classProbs=allocList$classProbs) 
    # , noTypMemb=7, moreTypMemb=c(10,25,50,100,200,500,1000)
str(myTypicalMembs)

plotScatter(outList, thin=1, xi11=c(1,1), xi12=c(2,2), xi21=c(2,2), 
    xi22=c(3,3), xi31=c(1,1), xi32=c(3,3) )

mySegPower <- calcSegmentationPower(outList, classProbs=allocList$classProbs, 
    class=allocList$class, printXtable=TRUE, calcSharp=TRUE, 
    printSharpXtable=TRUE ) 
    # , grLabels=paste("Group", 1:Prior$H)
str(mySegPower)

myEntropy <- calcEntropy(outList, classProbs=allocList$classProbs, 
    class=allocList$class, printXtable=TRUE ) 
    # , grLabels=paste("Group", 1:Prior$H)
myEntropy

plotLikeliPaths(outList, from=10, by=1 )

myNumEffTables <- calcNumEff( outList, thin=1, printXi=TRUE, printE=TRUE, 
    printBeta=TRUE, grLabels=paste("Group", 1:outList$Prior$H) ) 
str(myNumEffTables)

myMSCrits <- calcMSCritDMCExt(workDir=myOutfilesDir, myLabel="dmClustExtended-Example", 
    myN0=paste("N0 =",outList$Prior$N0), 
    whatToDoList=c("postMode", "approxML", "approxMCL") )
str(myMSCrits)

setwd(oldDir)

# ==================================================================================

if (is.element("MCCExtExampleData$covariates", search())) { 
    detach(MCCExtExampleData$covariates)
}

# ==================================================================================
} # end if
# ==================================================================================

# ==================================================================================

Run the code above in your browser using DataLab