BNPdensity v2019.9.11

0

Monthly downloads

0th

Percentile

Ferguson-Klass Type Algorithm for Posterior Normalized Random Measures

Bayesian nonparametric density estimation modeling mixtures by a Ferguson-Klass type algorithm for posterior normalized random measures.

Readme

BNPdensity

Bayesian nonparametric density estimation modeling mixtures by a Ferguson-Klass type algorithm for posterior normalized random measures.

Installation

You can install BNPdensity from github with:

# install.packages("devtools")
devtools::install_github("konkam/BNPdensity")

You will need to have the CRAN package devtools installed.

How to select the parameters of the Normalized Generalised Gamma process

We suggest the Normalised stable process, which corresponds to setting Alpha = 1, Kappa = 0 in the MixNRMIx functions. The stable process is a convenient model because its parameter γ has a convenient interpretation: it can be used to tune how informative the prior on the number of components is. Small values of Gama bring the process closer to a Dirichlet process, where the prior on the number of components is a relatively peaked distribution around (\alpha \log n). Larger values of Gama make this distribution flatter. More guidelines on how to choose the parameters may be found in Lijoi et al. (2007b), notably by considering the expected prior number of components.

We provide a function to compute the expected number of components for a normalised stable process:

library(BNPdensity)
expected_number_of_components_stable(100, 0.8)
#> 1 'mpfr' number of precision  5300   bits 
#> [1] 42.7094763347433064145139642374761953935253068800586645704061217932416521673216907205393634497387972550423672434248859853361487805996952132428560926301735979533962638142919404245491320418681146784117675560180180997897282151292943142921297120817286344277056532757201595800714590399801677580702348825476449555104774274897690717000458400780607335224957505642484513472533071356437832922468218811058624371598679991922247578979527622066152240312741621582983449454587742014738374261209928436865458013868459635836514415533408847942283699058466218465661304372077549346754754275121921814735346212725066178828226717075277232621519641895705391136625948612074142339580470342092448293688243480992850978992403497102847376421266264478072691640608266093779281787084116016688298257625264151386576179078300622956769752677608717950109926087638199943458107790372665551885265327164919674565823322119685775409502437206065096591074028037310013746802805763272997327361578989028373338101309163355379821181170147389233251989072209330180918520528393655319318639895074310114218241697995900064982050710406525820477683889052900277360670969502511993685089281685384529730226668468252315463797567163871977846036005950883423952365091248314552153703995699839012485091395887925071273380982174320743129811746851032817149541351629118801652203235928207673097109874143882905678985903497701028455475882246459109961351322459122178215039563311710681870538381442385456732732345958405084458427383775726002258077957996564219300933845643215966461641784547945616409630066291778349690565899480469847731894606043486473685568118112287972667409028307

This number may be compared to the prior number of components induced by a Dirichlet process with Alpha = 1:

expected_number_of_components_Dirichlet(100, 1.)
#> [1] 5.187378

We also provide a way to visualise the prior distribution on the number of components:

plot_prior_number_of_components(50, 0.4) 
#> Computing the prior probability on the number of clusters for the Dirichlet process
#> Computing the prior probability on the number of clusters for the Stable process

## How to fit a dataset

We illustrate the package by estimating the distribution of the acidity dataset.

library(BNPdensity)
data(acidity)
str(acidity)
#>  num [1:155] 2.93 3.91 3.73 3.69 3.82 ...
hist(acidity)

library(BNPdensity)
data(acidity)
fit = MixNRMI1(acidity, Nit = 3000)
#> MCMC iteration 500 of 3000 
#> MCMC iteration 1000 of 3000 
#> MCMC iteration 1500 of 3000 
#> MCMC iteration 2000 of 3000 
#> MCMC iteration 2500 of 3000 
#> MCMC iteration 3000 of 3000 
#>  >>> Total processing time (sec.):
#>    user  system elapsed 
#>  98.980   0.082  99.081

MixNRMI1() creates an object of class MixNRMI1, for which we provide common S3 methods.

print(fit)
#> Fit of a semiparametric normal mixture model on 155 data points.
#> The MCMC algorithm was run for 1500 iterations with 10 % discarded for burn-in.
plot(fit)

summary(fit)
#> Density estimation using a Normalized stable process with stability parameter Gamma = 0.4
#> A semiparametric normal mixture model was used.
#> There were 155 data points.
#> The MCMC algorithm was run for 1500 iterations with 10% discarded for burn-in.
#> To obtain information on the estimated number of clusters, please use summary(object, number_of_clusters = TRUE).

How to use the convergence diagnostics

We also provide an interface to run several chains in parallel, using the functions multMixNRMI1(). We interface our package with the coda package by providing a conversion method for the output this function. This allows for instance to compute the convergence diagnostics included in coda.

One detail is that due to the Nonparametric nature of the model, the number of parameters which could potentially be monitored for convergence of the chains varies. The location parameter of the clusters, for instance, vary at each iteration, and even the labels of the clusters vary, which makes them tricky to follow. However, it is possible to monitor the log-likelihood of the data along the iterations, the value of the latent variable u, the number of components and for the semi-parametric model, the value of the common scale parameter.

library(BNPdensity)
library(coda)
data(acidity)
fitlist = multMixNRMI1(acidity, Nit = 5000)
mcmc_list = as.mcmc(fitlist)
coda::traceplot(mcmc_list)

coda::gelman.diag(mcmc_list)
#> Potential scale reduction factors:
#> 
#>                 Point est. Upper C.I.
#> ncomp                 1.06       1.17
#> Sigma                 1.14       1.37
#> Latent_variable       1.06       1.14
#> log_likelihood        1.09       1.26
#> 
#> Multivariate psrf
#> 
#> 1.12

How to use the Goodness of fit plots

Non censored data

library(BNPdensity)
data(acidity)
fit = MixNRMI1(acidity, extras = TRUE)
#> MCMC iteration 500 of 1500 
#> MCMC iteration 1000 of 1500 
#> MCMC iteration 1500 of 1500 
#>  >>> Total processing time (sec.):
#>    user  system elapsed 
#>  40.746   0.035  40.783
GOFplots(fit)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Censored data

library(BNPdensity)
data(salinity)
fit = MixNRMI1cens(salinity$left,salinity$right, extras = TRUE)
#> MCMC iteration 500 of 1500 
#> MCMC iteration 1000 of 1500 
#> MCMC iteration 1500 of 1500 
#>  >>> Total processing time (sec.):
#>    user  system elapsed 
#>  61.339   0.013  61.358
GOFplots(fit)

Posterior analysis of the clustering structure

The MCMC algorithm provides a sample of the posterior distribution on the space of all clusterings. This is a very large discrete space, which is not ordered. This means that for any reasonably sized problem, each configuration in the posterior will have been explored no more than once or twice, and that many potentially good configurations will not be present in the MCMC sample. Moreover, the lack of ordering makes it not trivial to summarise the posterior by an optimal clustering and to provide credible sets.

We suggest using the approach developped in S. Wade and Z. Ghahramani, “Bayesian cluster analysis: Point estimation and credible balls (with discussion),” Bayesian Anal., vol. 13, no. 2, pp. 559–626, 2018.

The main proposal from this paper is to summarise the posterior on all possible clusterings by an optimal clustering where optimality is defined as minimising the posterior expectation of a specific loss function, the Variation of Information. Credible sets are also available.

We use the implementation described in R. Rastelli and N. Friel, “Optimal Bayesian estimators for latent variable cluster models,” Stat. Comput., vol. 28, no. 6, pp. 1169–1186, Nov. 2018, which is faster and implemented in the CRAN package GreedyEPL.

Using this approach requires installing the R package GreedyEPL, which can be achieved with the following command:

install.packages("GreedyEPL")

Note that investigating the clustering makes more sense for the fully Nonparametric NRMI model than for the Semiparametric. This is because to use a single scale parameters for all the clusters, the Semiparametric model may favour numerous small clusters, for flexibility. The larger number of clusters may render interpretation of the clusters more challenging.

The clustering structure may be visualised as follows:

data(acidity)
out <- MixNRMI2(acidity,  extras = TRUE)
#> MCMC iteration 500 of 1500 
#> MCMC iteration 1000 of 1500 
#> MCMC iteration 1500 of 1500 
#>  >>> Total processing time (sec.):
#>    user  system elapsed 
#>  22.760   0.044  22.804
clustering = compute_optimal_clustering(out)
plot_clustering_and_CDF(out, clustering)

Functions in BNPdensity

Name Description
Enzyme1.out Fit of MixNRMI1 function to the enzyme dataset
GOFplots_noncensored Plot Goodness of fits graphical checks for non censored data
Galaxy2.out Fit of MixNRMI2 function to the galaxy dataset
GOFplots Plot Goodness of fits graphical checks for censored data
Galaxy1.out Fit of MixNRMI1 function to the galaxy dataset
BNPdensity-package Bayesian nonparametric density estimation
MixNRMI1 Normalized Random Measures Mixture of Type I
Enzyme2.out Fit of MixNRMI2 function to the enzyme dataset
GOFplots_censored Plot Goodness of fits graphical checks for censored data
MixNRMI1cens Normalized Random Measures Mixture of Type I for censored data
Mv Continuous Jump heights function
cens_data_check Censoring data check
grid_from_data Create a plotting grid from censored or non-censored data.
dhalfcauchy Density half Cauchy
fcondXA Conditional density evaluation in the semiparametric model
MvInv Invert jump heights function
cpo Conditional predictive ordinate function
fcondXA2 Conditional density evaluation in the fully nonparametric model
MixNRMI2cens Normalized Random Measures Mixture of Type II for censored data
MixNRMI2 Normalized Random Measures Mixture of Type II
dtnorm Density truncated normal
grid_from_data_censored Create a plotting grid from censored data.
comp2 Ties function: bivariate
dk Kernel density function
compute_thinning_grid Compute the grid for thinning the MCMC chain
censor_code_rl Censor code right-left
dkcens2 Density of the chosen kernel
convert_to_mcmc Convert the output of multMixNRMI into a coda mcmc object
as.mcmc.multNRMI Convert the output of multMixNRMI into a coda mcmc object
fill_sigmas Repeat the common scale parameter of a semiparametric model to match the dimension of the location parameters.
asNumeric_no_warning If the function Rmpfr::asNumeric returns a warning about inefficiency, silence it.
fcondYZXAcens2 Conditional posterior distribution of the bivariate latents (Y,Z) in the case of censoring
dhalfnorm Density half normal
acidity Acidity Index Dataset
galaxy Galaxy Data Set
gs4 Resampling Ystar function
expected_number_of_components_Dirichlet Computes the expected number of components for a Dirichlet process.
expected_number_of_components_stable Computes the expected number of components for a stable process.
give_kernel_name Gives the kernel name from the integer code
phalft Distribution function half Student-t
dhalft Density half Student-t
gs4cens2 Resampling Ystar function in the case of censoring
pk Kernel distribution function
fcondYXA Conditional posterior distribution of the latents Y
plotCDF_noncensored Plot the empirical and fitted CDF for non censored data.
fcondXA2cens2 Conditional density evaluation in the fully nonparametric model for censored data
compute_optimal_clustering Compute the optimal clustering from an MCMC sample
gsHP Updates the hyper-parameters of py0
multMixNRMI2 Multiple chains of MixNRMI2
multMixNRMI1cens Multiple chains of MixNRMI1cens
gsYZstar Jointly resampling Ystar and Zstar function
print.NRMI2cens S3 method for class 'MixNRMI2cens'
plotPDF_censored Plot the density for censored data.
print.multNRMI S3 method for class 'multNRMI'
salinity Salinity tolerance
gs5 Conditional posterior distribution of sigma
plotPDF_noncensored Plot the density and a histogram for non censored data.
enzyme Enzyme Dataset
plot_clustering_and_CDF Plot the clustering and the Cumulative Distribution Function
grid_from_data_noncensored Create a plotting grid from non-censored data.
plot.NRMI1 Plot the density estimate and the 95% credible interval
plot.NRMI1cens Plot the density estimate and the 95% credible interval
rtnorm Random number generator for a truncated normal distribution
plot_prior_number_of_components This plots the prior distribution on the number of components for the stable process. The Dirichlet process is provided for comparison.
gs3 Conditional posterior distribution of latent U
ptnorm Distribution function truncated normal
qq_plot_censored Plot the quantile-quantile graph for censored data.
gsYZstarcens2 Jointly resampling Ystar and Zstar function in the case of censoring
qq_plot_noncensored Plot the quantile-quantile graph for non censored data.
add Add x and y
summary.NRMI1 S3 method for class 'MixNRMI1'
fcondYZXA Conditional posterior distribution of the bivariate latents (Y,Z)
comp1 Ties function: univariate
is_semiparametric Tests if a fit is a semi parametric or nonparametric model.
dkcens2_1val Density evaluation once
comment_on_NRMI_type Comment on the NRMI process depending on the value of the parameters
dt_ Non-standard student-t density
fcondYXAcens2 Conditional posterior distribution of the latents Y in the censoring case
gs5cens2 Conditional posterior distribution of sigma in the case of censoring
is_censored Test if the data is censored
rhalft Random number generator half Student-t
plotfit_censored Plot the density estimate and the 95% credible interval for censored data
rk Kernel density sampling function
plot.NRMI2 Plot the density estimate and the 95% credible interval
print.NRMI1 S3 method for class 'MixNRMI1'
qhalft Quantile function half Student-t
phalfnorm Distribution function half Normal
plot.NRMI2cens Plot the density estimate and the 95% credible interval
phalfcauchy Distribution function half Cauchy
qhalfnorm Quantile function half Normal
pp_plot_noncensored Plot the percentile-percentile graph for non censored data.
rfyzstar Conditional posterior distribution of the distinct vectors (Ystar,Zstar)
multMixNRMI1 Multiple chains of MixNRMI1
pt_ Distribution function non-standard student-t
rt_ Random number generator non-standard Student-t
print.NRMI2 S3 method for class 'MixNRMI2'
rhalfnorm Random number generator half Normal
qhalfcauchy Quantile function half Cauchy
p0 Centering function
print.NRMI1cens S3 method for class 'MixNRMI1cens'
multMixNRMI2cens Multiple chains of MixNRMI2cens
rhalfcauchy Random number generator half Cauchy
qgeneric Generic function to find quantiles of a distribution
rfyzstarcens2 Conditional posterior distribution of the distinct vectors (Ystar,Zstar) in the case of censoring
summarytext Common text for the summary S3 methods
summary.NRMI2cens S3 method for class 'MixNRMI2cens'
summary.multNRMI S3 method for class 'multNRMI'
traceplot Draw a traceplot for multiple chains
plotCDF_censored Plot the Turnbull CDF and fitted CDF for censored data.
pp_plot_censored Plot the percentile-percentile graph for non censored data, using the Turnbull estimator the position of the percentiles.
qtnorm Quantile function truncated normal
qt_ Quantile function non-standard Student-t
plotfit_noncensored Plot the density estimate and the 95% credible interval for noncensored data
plot.multNRMI Plot the density estimate and the 95% credible interval
rfystar Conditional posterior distribution of the distinct Ystar
summary.NRMI1cens S3 method for class 'MixNRMI1cens'
summary.NRMI2 S3 method for class 'MixNRMI2'
rfystarcens2 Conditional posterior distribution of the distinct Ystar in the case of censoring
No Results!

Last month downloads

Details

Type Package
License GPL (>= 2)
Encoding UTF-8
LazyData yes
ByteCompile yes
RoxygenNote 6.1.1
NeedsCompilation yes
Packaged 2019-09-11 10:46:36 UTC; guillaume
Repository CRAN
Date/Publication 2019-09-11 11:10:05 UTC

Include our badge in your README

[![Rdoc](http://www.rdocumentation.org/badges/version/BNPdensity)](http://www.rdocumentation.org/packages/BNPdensity)