Learn R Programming

BNPdensity

Bayesian nonparametric density estimation modeling mixtures by a Ferguson-Klass type algorithm for posterior normalized random measures.

Installation

You can install BNPdensity from github with:

# install.packages("devtools")
devtools::install_github("konkam/BNPdensity", dependencies = TRUE, upgrade = TRUE)

You will need to have the CRAN package devtools installed.

Problem setting

We consider a one-dimensional density estimation problem. As an example, we pick the acidity dataset, which contains an acidity index measured in a sample of 155 lakes in north-central Wisconsin (Crawford et al. (1992)).

We illustrate the package by estimating the distribution of the acidity dataset.

Let’s have a quick overview of the acidity dataset:

library(BNPdensity)
data(acidity)
str(acidity)
#>  num [1:155] 2.93 3.91 3.73 3.69 3.82 ...
hist(acidity)

This dataset shows clear signs of multimodality, and it might be tempting to fit a bimodal normal distribution. However, some asymmetry in the two apparent clusters, or the extreme point on the left suggest that a mixture with more components might be more appropriate.

A Bayesian Nonparametric approach avoids the need to specify an arbitrary number of clusters in advance, it rather aims at estimating an optimal number of clusters from the dataset by means of an infinite mixture model.

The most famous Bayesian Nonparametric model for infinite mixtures is the Dirichlet process. However, it implies a fairly informative a priori distribution on the number of components in the mixture: the mode of the prior distribution is proportional to the logarithm of the number of data points, and the distribution is fairly peaked. This might not accurately reflect the prior information available on the number of components, or might induce an unwanted bias in the case of prior mis-specification. Models more general than the Dirichlet process, such as Normalized Random Measure models (Barrios et al. (2013)) allow overcoming this limitation by specifying a less informative prior. In what follows, we present how to use Normalized Random Measure models for density estimation.

Obtaining a Bayesian Nonparametric density estimate in a few seconds

library(BNPdensity)
data(acidity)
fit = MixNRMI1(acidity, Nit = 3000)
#> MCMC iteration 500 of 3000 
#> MCMC iteration 1000 of 3000 
#> MCMC iteration 1500 of 3000 
#> MCMC iteration 2000 of 3000 
#> MCMC iteration 2500 of 3000 
#> MCMC iteration 3000 of 3000 
#>  >>> Total processing time (sec.):
#>    user  system elapsed 
#> 136.064   0.144 136.288

MixNRMI1() creates an object of class MixNRMI1, for which we provide common S3 methods.

print(fit)
#> Fit of a semiparametric normal mixture model on 155 data points.
#> The MCMC algorithm was run for 3000 iterations with 10 % discarded for burn-in.
plot(fit)
#> Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
#> ℹ Please use `after_stat(density)` instead.
#> ℹ The deprecated feature was likely used in the BNPdensity package.
#>   Please report the issue to the authors.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.

How to select the parameters of the Normalized Generalized Gamma process

We suggest the Normalized stable process, which corresponds to setting Alpha = 1, Kappa = 0 in the MixNRMIx functions. The stable process is a convenient model because its parameter γ has a convenient interpretation: it can be used to tune how informative the prior on the number of components is. Small values of Gama bring the process closer to a Dirichlet process, where the prior on the number of components is a relatively peaked distribution around $\alpha \log n$. Larger values of Gama make this distribution flatter. More guidelines on how to choose the parameters may be found in Lijoi et al. (2007b), notably by considering the expected prior number of components.

We provide a function to compute the expected number of components for a normalized stable process:

library(BNPdensity)
expected_number_of_components_stable(100, 0.8)
#> 1 'mpfr' number of precision  5300   bits 
#> [1] 42.7094763347433063870164430179377602285097873537020509591067085108693089953308485568922629628925609670388816355409267145879811608727979635644774974037316163731591853953068391985644986476120011372960450707841949357628205832352455972108307720257846739029935831367778087474278900096286104350854420468743750780953727437388381753822866058766720173377298935849197133078525375521665344039604097969307794640101701067874475659220445305472235893927481095361462556416422489918801010270531398200441645390722656975243565592756687766475481859394293400264017298519219232910313654037149553666679890694710843958583376865920388291893659788102528801718172192946031922288386233821810410501186383355395149325749466413257183113733305353511545188982744656555395398972901331812210878443370420188050079296961588354717569901177367135749962447447224416420675294481931830396470678197938068495726792098492301861867848615495349033316737274460428156601729995582227093984927573684202948081480326602431090471715657360243449939490449

This number may be compared to the prior number of components induced by a Dirichlet process with Alpha = 1:

expected_number_of_components_Dirichlet(100, 1.)
#> [1] 5.187378

We also provide a way to visualise the prior distribution on the number of components:

plot_prior_number_of_components(50, 0.4) 
#> Computing the prior probability on the number of clusters for the Dirichlet process
#> Computing the prior probability on the number of clusters for the Stable process

## How to fit a dataset

We illustrate the package by estimating the distribution of the acidity dataset.

library(BNPdensity)
data(acidity)
str(acidity)
#>  num [1:155] 2.93 3.91 3.73 3.69 3.82 ...
hist(acidity)

library(BNPdensity)
data(acidity)
fit = MixNRMI1(acidity, Nit = 3000)
#> MCMC iteration 500 of 3000 
#> MCMC iteration 1000 of 3000 
#> MCMC iteration 1500 of 3000 
#> MCMC iteration 2000 of 3000 
#> MCMC iteration 2500 of 3000 
#> MCMC iteration 3000 of 3000 
#>  >>> Total processing time (sec.):
#>    user  system elapsed 
#> 132.491   0.047 132.594

MixNRMI1() creates an object of class MixNRMI1, for which we provide common S3 methods.

print(fit)
#> Fit of a semiparametric normal mixture model on 155 data points.
#> The MCMC algorithm was run for 3000 iterations with 10 % discarded for burn-in.
plot(fit)

summary(fit)
#> Density estimation using a Normalized stable process,
#> with stability parameter Gamma = 0.4
#> 
#> A semiparametric normal mixture model was used.
#> 
#> There were 155 data points.
#> 
#> The MCMC algorithm was run for 3000 iterations with 10% discarded for burn-in.
#> 
#> To obtain information on the estimated number of clusters,
#>  please use summary(object, number_of_clusters = TRUE).

How to use the convergence diagnostics

We also provide an interface to run several chains in parallel, using the functions multMixNRMI1(). We interface our package with the coda package by providing a conversion method for the output this function. This allows for instance to compute the convergence diagnostics included in coda.

One detail is that due to the Nonparametric nature of the model, the number of parameters which could potentially be monitored for convergence of the chains varies. The location parameter of the clusters, for instance, vary at each iteration, and even the labels of the clusters vary, which makes them tricky to follow. However, it is possible to monitor the log-likelihood of the data along the iterations, the value of the latent variable u, the number of components and for the semi-parametric model, the value of the common scale parameter.

library(BNPdensity)
library(coda)
#> 
#> Attaching package: 'coda'
#> The following object is masked from 'package:BNPdensity':
#> 
#>     traceplot
data(acidity)
fitlist = multMixNRMI1(acidity, Nit = 5000)
mcmc_list = as.mcmc(fitlist)
coda::traceplot(mcmc_list)

coda::gelman.diag(mcmc_list)

How to use the Goodness of fit plots

Non censored data

library(BNPdensity)
data(acidity)
fit = MixNRMI1(acidity, extras = TRUE)
#> MCMC iteration 500 of 1500 
#> MCMC iteration 1000 of 1500 
#> MCMC iteration 1500 of 1500 
#>  >>> Total processing time (sec.):
#>    user  system elapsed 
#>  57.618   0.099  57.746
GOFplots(fit)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Censored data

library(BNPdensity)
data(salinity)
fit = MixNRMI1cens(salinity$left,salinity$right, extras = TRUE)
#> MCMC iteration 500 of 1500 
#> MCMC iteration 1000 of 1500 
#> MCMC iteration 1500 of 1500 
#>  >>> Total processing time (sec.):
#>    user  system elapsed 
#>  55.512   0.005  55.526
GOFplots(fit)

Posterior analysis of the clustering structure

The MCMC algorithm provides a sample of the posterior distribution on the space of all clusterings. This is a very large discrete space, which is not ordered. This means that for any reasonably sized problem, each configuration in the posterior will have been explored no more than once or twice, and that many potentially good configurations will not be present in the MCMC sample. Moreover, the lack of ordering makes it not trivial to summarize the posterior by an optimal clustering and to provide credible sets.

We suggest using the approach developed in S. Wade and Z. Ghahramani, “Bayesian cluster analysis: Point estimation and credible balls (with discussion),” Bayesian Anal., vol. 13, no. 2, pp. 559–626, 2018.

The main proposal from this paper is to summarize the posterior on all possible clusterings by an optimal clustering where optimality is defined as minimizing the posterior expectation of a specific loss function, the Variation of Information. Credible sets are also available.

We use the implementation described in R. Rastelli and N. Friel, “Optimal Bayesian estimators for latent variable cluster models,” Stat. Comput., vol. 28, no. 6, pp. 1169–1186, Nov. 2018, which is faster and implemented in the CRAN package GreedyEPL.

Using this approach requires installing the R package GreedyEPL, which can be achieved with the following command:

install.packages("GreedyEPL")

Note that investigating the clustering makes more sense for the fully Nonparametric NRMI model than for the Semiparametric. This is because to use a single scale parameters for all the clusters, the Semiparametric model may favor numerous small clusters, for flexibility. The larger number of clusters may render interpretation of the clusters more challenging.

The clustering structure may be visualized as follows:

data(acidity)
out <- MixNRMI2(acidity,  extras = TRUE)
#> MCMC iteration 500 of 1500 
#> MCMC iteration 1000 of 1500 
#> MCMC iteration 1500 of 1500 
#>  >>> Total processing time (sec.):
#>    user  system elapsed 
#>  17.107   0.103  17.218
clustering = compute_optimal_clustering(out)
plot_clustering_and_CDF(out, clustering)

We now also offer the option of using the implementation described in:

D. B. Dahl, D. J. Johnson, and P. Müller (2022),
Search Algorithms and Loss Functions for Bayesian Clustering,
Journal of Computational and Graphical Statistics, 31(4), 1189–1201.
https://doi.org/10.1080/10618600.2022.2069779

This approach is implemented in the R package salso, and can be used as an alternative to GreedyEPL for computing the optimal clustering under the same loss functions (e.g., Variation of Information). SALSO by default performs multiple runs with different initialisations, which improves the robustness of the optimisation results.

To use this alternative, simply specify the method in the compute_optimal_clustering function:

#install.packages("salso")  # if not already installed

data(acidity)
out <- MixNRMI2(acidity, extras = TRUE)
#> MCMC iteration 500 of 1500 
#> MCMC iteration 1000 of 1500 
#> MCMC iteration 1500 of 1500 
#>  >>> Total processing time (sec.):
#>    user  system elapsed 
#>  20.550   0.046  20.626
clustering <- compute_optimal_clustering(out, method = "SALSO")
plot_clustering_and_CDF(out, clustering)

Copy Link

Version

Install

install.packages('BNPdensity')

Monthly Downloads

432

Version

2025.7.29

License

GPL (>= 2)

Maintainer

Guillaume Kon Kam King

Last Published

July 29th, 2025

Functions in BNPdensity (2025.7.29)

asNumeric_no_warning

If the function Rmpfr::asNumeric returns a warning about inefficiency, silence it.
MixNRMI2cens

Normalized Random Measures Mixture of Type II for censored data
as.mcmc.multNRMI

Convert the output of multMixNRMI into a coda mcmc object
add

Add x and y
MixPY1

Pitman-Yor process mixture of Type I
MixPY2

Pitman-Yor process mixture of Type II
MvInv

Invert jump heights function
acidity

Acidity Index Dataset
cens_data_check

Censoring data check
censor_code_rl

Censor code right-left
Mv

Continuous Jump heights function
compute_optimal_clustering

Compute the optimal clustering from an MCMC sample
compute_thinning_grid

Compute the grid for thinning the MCMC chain
cpo

Conditional predictive ordinate function
cpo.NRMI2

Extract the Conditional Predictive Ordinates (CPOs) from a fitted object
dhalfcauchy

Density half Cauchy
dhalfnorm

Density half normal
cpo.default

Extract the Conditional Predictive Ordinates (CPOs) from a fitted object
comment_on_NRMI_type

Comment on the NRMI process depending on the value of the parameters
cpo.multNRMI

Extract the Conditional Predictive Ordinates (CPOs) from a list of fitted objects
dist_name_k_index_converter

Convert distribution names to indices
dtnorm

Density truncated normal
comp2

Ties function: bivariate
comp1

Ties function: univariate
dhalft

Density half Student-t
grid_from_data_censored

Create a plotting grid from censored data.
grid_from_data

Create a plotting grid from censored or non-censored data.
enzyme

Enzyme Dataset
fcondYXAcens2

Conditional posterior distribution of the latents Y in the censoring case
give_kernel_name

Gives the kernel name from the integer code
galaxy

Galaxy Data Set
dk

Kernel density function
fcondXA2cens2

Conditional density evaluation in the fully nonparametric model for censored data
dkcens2

Density of the chosen kernel
gs3_adaptive3

Conditional posterior distribution of latent U
convert_to_mcmc

Convert the output of multMixNRMI into a coda mcmc object
fcondYXA

Conditional posterior distribution of the latents Y
gsHP

Updates the hyper-parameters of py0
gs3_log

Conditional posterior distribution of latent logU
dt_

Non-standard student-t density
cpo.NRMI1

Extract the Conditional Predictive Ordinates (CPOs) from a fitted object
is_semiparametric

Tests if a fit is a semi parametric or nonparametric model.
log_Vnk_PY

Calculate the Logarithm of the Vnk weights for the Pitman-Yor model
gsYZstar

Jointly resampling Ystar and Zstar function
fcondYZXA

Conditional posterior distribution of the bivariate latents (Y,Z)
dkcens2_1val

Density evaluation once
is_censored

Test if the data is censored
gs5

Conditional posterior distribution of sigma
logf_logu_cond_y

Contribution of the target logdensity of logU to the Metropolis-Hastings ratio
multMixNRMI2cens

Multiple chains of MixNRMI2cens
multMixNRMI2

Multiple chains of MixNRMI2
phalfnorm

Distribution function half Normal
plot.NRMI2

Plot the density estimate and the 95% credible interval
plot.PY1

Plot the density estimate and the 95% credible interval
logf_u_cond_y

Target logdensity of U given the data
multMixNRMI1

Multiple chains of MixNRMI1
gs4

Resampling Ystar function
plot_prior_number_of_components

This plots the prior distribution on the number of components for the stable process. The Dirichlet process is provided for comparison.
plot_clustering_and_CDF

Plot the clustering and the Cumulative Distribution Function
multMixNRMI1cens

Multiple chains of MixNRMI1cens
phalft

Distribution function half Student-t
gs4cens2

Resampling Ystar function in the case of censoring
fcondXA

Conditional density evaluation in the semiparametric model
print.NRMI1

S3 method for class 'MixNRMI1'
plot.NRMI1

Plot the density estimate and the 95% credible interval
pk

Kernel distribution function
grid_from_data_noncensored

Create a plotting grid from non-censored data.
pt_

Distribution function non-standard student-t
fcondXA2

Conditional density evaluation in the fully nonparametric model
ptnorm

Distribution function truncated normal
gs5cens2

Conditional posterior distribution of sigma in the case of censoring
logacceptance_ratio_logu

Metropolis-Hastings ratio for the conditional of logU
rprop_logu

Proposal distribution for logU
gs3

Conditional posterior distribution of latent U
rt_

Random number generator non-standard Student-t
fill_sigmas

Repeat the common scale parameter of a semiparametric model to match the dimension of the location parameters.
plotfit_noncensored

Plot the density estimate and the 95% credible interval for noncensored data
logdprop_logu

Contribution of the proposal kernel logdensity to the Metropolis-Hastings ratio
expected_number_of_components_stable

Computes the expected number of components for a stable process.
expected_number_of_components_Dirichlet

Computes the expected number of components for a Dirichlet process.
plotCDF_noncensored

Plot the empirical and fitted CDF for non censored data.
plotfit_censored

Plot the density estimate and the 95% credible interval for censored data
plotCDF_censored

Plot the Turnbull CDF and fitted CDF for censored data.
plot.multNRMI

Plot the density estimate and the 95% credible interval
plot.PY2

Plot the density estimate and the 95% credible interval
phalfcauchy

Distribution function half Cauchy
p0

Centering function
fcondYZXAcens2

Conditional posterior distribution of the bivariate latents (Y,Z) in the case of censoring
print.NRMI2

S3 method for class 'MixNRMI2'
gsYZstarcens2

Jointly resampling Ystar and Zstar function in the case of censoring
qhalfcauchy

Quantile function half Cauchy
rhalft

Random number generator half Student-t
print.PY1

S3 method for class 'PY1'
print.PY2

S3 method for class 'PY2'
rfyzstarcens2

Conditional posterior distribution of the distinct vectors (Ystar,Zstar) in the case of censoring
pp_plot_censored

Plot the percentile-percentile graph for non censored data, using the Turnbull estimator the position of the percentiles.
plotPDF_noncensored

Plot the density and a histogram for non censored data.
plotPDF_censored

Plot the density for censored data.
process_dist_name

Process the distribution name argument into a distribution index
qhalfnorm

Quantile function half Normal
pp_plot_noncensored

Plot the percentile-percentile graph for non censored data.
thresholdGG

Choosing the truncation level for the NGG process
rfyzstar

Conditional posterior distribution of the distinct vectors (Ystar,Zstar)
qgeneric

Generic function to find quantiles of a distribution
rfystar

Conditional posterior distribution of the distinct Ystar
qq_plot_censored

Plot the quantile-quantile graph for censored data.
rk

Kernel density sampling function
traceplot

Draw a traceplot for multiple chains
summary.NRMI2

S3 method for class 'MixNRMI2'
summary.NRMI1

S3 method for class 'MixNRMI1'
rhalfnorm

Random number generator half Normal
qhalft

Quantile function half Student-t
rfystarcens2

Conditional posterior distribution of the distinct Ystar in the case of censoring
rhalfcauchy

Random number generator half Cauchy
qq_plot_noncensored

Plot the quantile-quantile graph for non censored data.
summary.PY1

S3 method for class 'PY1'
summary.PY2

S3 method for class 'PY2'
summarytext

Common text for the summary S3 methods
print.multNRMI

S3 method for class 'multNRMI'
summary.multNRMI

S3 method for class 'multNRMI'
qt_

Quantile function non-standard Student-t
rtnorm

Random number generator for a truncated normal distribution
salinity

Salinity tolerance
qtnorm

Quantile function truncated normal
MixNRMI1cens

Normalized Random Measures Mixture of Type I for censored data
GOFplots_censored

Plot Goodness of fits graphical checks for censored data
GOFplots

Plot Goodness of fits graphical checks for censored data
GOFplots_noncensored

Plot Goodness of fits graphical checks for non censored data
Galaxy2.out

Fit of MixNRMI2 function to the galaxy dataset
Enzyme1.out

Fit of MixNRMI1 function to the enzyme dataset
MixNRMI1

Normalized Random Measures Mixture of Type I
MixNRMI2

Normalized Random Measures Mixture of Type II
Enzyme2.out

Fit of MixNRMI2 function to the enzyme dataset
Galaxy1.out

Fit of MixNRMI1 function to the galaxy dataset