bnpglm: Bayesian nonparametric generalized linear models

Description

Fits Dirichlet Process mixtures of joint response-covariate models, where the covariates are continuous while the discrete response is represented utilizing continuous latent variables. See `Details' section for a full model description.

Usage

bnpglm(formula,family=poisson,data,offset,sampler="slice",WorkingDir,
       ncomp,sweeps,burn,seed,V,Vdf,Mu.nu,Sigma.nu,Mu.mu,Sigma.mu,
       Alpha.gamma,Beta.gamma,Alpha.alpha,Beta.alpha,Turnc.alpha,
       Xpred,offsetPred,...)

Arguments

formula

a formula defining the response and the covariates e.g. y ~ x.

family

a description of the distribution of the response variable. family=poisson and family=binomial are the current options.

data

an optional data frame, list or environment (or object coercible by `as.data.frame' to a data frame) containing the variables in the model. If not found in `data', the variables are taken from `environment(formula)'.

offset

this can be used to specify an a priori known component to be included in the linear predictor during fitting. This should be `NULL' or a numeric vector of length equal to the sample size. One `offset' term can be included i

sampler

the MCMC algorithm to be utilized. The two options are sampler="slice" which implements the slice sampler (Walker, 2007; Papaspiliopoulos, 2008) and sampler="truncated" which proceeds

WorkingDir

a directory to store files with the posterior samples of all models parameters. If a directory is not provided, files are not written.

ncomp

number of mixture components. Defines where the countable mixture of densities [in (1) below] is truncated. Even if sampler="slice" is chosen, ncomp needs to be specified as it is used in the initialization process

sweeps

integer number of posterior samples required.

burn

integer number of samples in the burn-in period. In the files described below, sweeps-burn samples will be written.

seed

optional seed for the random generator.

scale matrix $V$ of the prior Wishart distribution assigned to precision matrix $T_h$. See `Details' section.

Vdf

degrees of freedom Vdf of the prior Wishart distribution assigned to precision matrix $T_h$. See `Details' section.

Mu.nu

prior mean $\mu_{\nu}$ of the covariance vector $\nu_h$. See `Details' section.

Sigma.nu

prior covariance matrix $\Sigma_{\nu}$ of $\nu_h$. See `Details' section.

Mu.mu

prior mean $\mu_{\mu}$ of the mean vector $\mu_h$. See `Details' section.

Sigma.mu

prior covariance matrix $\Sigma_{\mu}$ of $\mu_h$. See `Details' section.

Alpha.gamma

shape parameter $\alpha_{\gamma}$ of the Gamma prior assigned to the Poisson rate $\gamma_h$. See `Details' section.

Beta.gamma

rate parameter $\beta_{\gamma}$ of the Gamma prior assigned to the Poisson rate $\gamma_h$. See `Details' section.

Alpha.alpha

shape parameter $\alpha_{\alpha}$ of the Gamma prior assigned to the concentration parameter $\alpha$. See `Details' section.

Beta.alpha

rate parameter $\beta_{\alpha}$ of the Gamma prior assigned to concentration parameter $\alpha$. See `Details' section.

Turnc.alpha

truncation point $c_{\alpha}$ of the Gamma prior assigned to concentration parameter $\alpha$. See `Details' section.

Xpred

A design matrix the rows of which include the covariates $x$ for which the conditional distribution of $Y|x,D$ (where $D$ denotes the data) is calculated. These are treated as `new' covariates i.e. they do not contribute to the

offsetPred

The offset term associated with the new covariates Xpred. offsetPred is a vector of length equal to the rows of Xpred. If family=poisson, its entries are the associated

...

Other options that will be ignored.

Value

Function bnpglm returns the following:
callthe matched call.
seedthe seed that was used (in case replication of the results is needed).
meanRegif Xpred is specified, the function returns the conditional expectation of the response given each new covariate $x$.
medianRegif Xpred is specified, the function returns the conditional median of the response given each new covariate $x$.
q1Regif Xpred is specified, the function returns the conditional first quantile of the response given each new covariate $x$.
q3Regif Xpred is specified, the function returns the conditional third quantile of the response given each new covariate $x$.
modeRegif Xpred is specified, the function returns the conditional mode of the response given each new covariate $x$.
Further, function bnpglm creates files where the posterior samples are written. These files are (with all file names preceded by `BNSP.'):
Th.txtthis file contains samples from the posteriors of the $p \times p$ precision matrices $T_h, h=1,2,\dots,ncomp$. The file is arranged in (sweeps-burn)*ncomp lines and $p^2$ columns. In more detail, each sweep creates ncomp lines representing samples $T_h^{(sw)}, h=1,\dots,ncomp$, where superscript $sw$ represents a particular sweep. The elements of $T_h^{(sw)}$ are written in the columns of the file: the entries in the first $p$ columns of the file are those in the first column (or row) of $T_h^{(sw)}$, while the entries in the last $p$ columns of the file are those in the last column (or row) of $T_h^{(sw)}$.
Sigmah.txtthis file contains samples from the posteriors of the $p \times p$ covariance matrices $\Sigma_h, h=1,2,\dots,ncomp$. The file is arranged in (sweeps-burn)*ncomp lines and $p^2$ columns. In more detail, each sweep creates ncomp lines representing samples $\Sigma_h^{(sw)}, h=1,\dots,ncomp$, where superscript $sw$ represents a particular sweep. The elements of $\Sigma_h^{(sw)}$ are written in the columns of the file: the entries in the first $p$ columns of the file are those in the first column (or row) of $\Sigma_h^{(sw)}$, while the entries in the last $p$ columns of the file are those in the last column (or row) of $\Sigma_h^{(sw)}$.
SigmahI.txtthis file contains samples from the posteriors of the $p \times p$ precision matrices $\Sigma_h^{-1}, h=1,2,\dots,ncomp$. The file is arranged in (sweeps-burn)*ncomp lines and $p^2$ columns. In more detail, each sweep creates ncomp lines representing samples $(\Sigma_h^{-1})^{(sw)}, h=1,\dots,ncomp$, where superscript $sw$ represents a particular sweep. The elements of $(\Sigma_h^{-1})^{(sw)}$ are written in the columns of the file: the entries in the first $p$ columns of the file are those in the first column (or row) of $(\Sigma_h^{-1})^{(sw)}$, while the entries in the last $p$ columns of the file are those in the last column (or row) of $(\Sigma_h^{-1})^{(sw)}$.
nuh.txtthis file contains samples from the posteriors of the $p$-dimensional covariance vectors $\nu_h, h=1,2,\dots,ncomp$. The file is arranged in (sweeps-burn)*ncomp lines and $p$ columns. In more detail, each sweep creates ncomp lines representing samples $\nu_h^{(sw)}, h=1,\dots,ncomp$, where superscript $sw$ represents a particular sweep. The elements of $\nu_h^{(sw)}$ are written in the columns of the file.
muh.txtthis file contains samples from the posteriors of the $p$-dimensional mean vectors $\mu_h, h=1,2,\dots,ncomp$. The file is arranged in (sweeps-burn)*ncomp lines and $p$ columns. In more detail, each sweep creates ncomp lines representing samples $\mu_h^{(sw)}, h=1,\dots,ncomp$, where superscript $sw$ represents a particular sweep. The elements of $\mu_h^{(sw)}$ are written in the columns of the file.
gammah.txtthis file contains samples from the posteriors of the mean parameters $\gamma_h$, $h=1,2,\dots,ncomp$. The file is arranged in (sweeps-burn)*ncomp lines and one column. Sweeps write in the file ncomp lines representing samples $\gamma_h^{(sw)}, h=1,\dots,ncomp$, where superscript $sw$ represents a particular sweep.
alpha.txtthis file contains samples from the posterior of the concentration parameters $\alpha$. The file is arranged in (sweeps-burn) lines and one column, each line including one posterior sample.
compAlloc.txtthis file contains the allocations or configurations obtained at each iteration of the sampler. It consists of (sweeps-burn) lines, that represent the posterior samples, and $n$ columns, that represent the sampling units. Entries in this file range from 0 to $ncomp-1$.
nmembers.txtthis file contains (sweeps-burn) lines and ncomp columns, where the lines represent posterior samples while the columns represent the components or clusters. The entries represent the number of sampling units allocated to the components.
Updated.txtthis file contains (sweeps-burn) lines with the number of components updated at each iteration of the sampler.
PD.txtthis file contains samples from the posterior conditional distribution $Y|x,D$ described in Xpred. The file has (sweeps-burn)*npred lines, where npred is the number of rows in Xpred. That is, at each iteration of the sampler, one line for each `new' covariate vector $x$ is written. The columns of the file represent the possible values of $Y$, staring from zero and continuing to a max number.

Details

Function bnpglm returns samples from the posterior distributions of the parameters of the model: $$f(y_i,x_i) = \sum_{h=0}^{\infty} \pi_h f(y_i,x_i|\theta_h), \hspace{80pt} (1)$$ where $y_i$ is a univariate response with distribution that belongs in the exponential family, $x_i$ is a $p$-dimensional vector of continuous covariates, and $\pi_h, h \geq 1,$ are obtained according to Sethuraman's (1994) stick-breaking construction: $\pi_1 = v_1$, and for $l \geq 2, \pi_l = v_l \prod_{j=1}^{l-1} (1-v_j)$, where $v_k$ are iid samples $v_k \sim$Beta$(1,\alpha), k \geq 1.$ The discrete responses $y_i$ are represented as discretized versions of continuous latent variables $y_i^*$. Observed discrete and continuous latent variables are connected by: $$y_{i} = q \iff c_{h,q-1} < y^*_{i} < c_{h,q},$$ where the cut-points are obtained as: $c_{h,-1} = -\infty$, while for $q \geq 0$, $c_{h,q} = c_{q}(\gamma_{h},H_i) = \Phi^{-1}{F(q;\gamma_{h},H_i)}.$ Here $\Phi(.)$ is the cumulative distribution function (cdf) of a standard normal variable, and $F(.;\gamma,H)$ denotes an appropriate cdf. For instance, for modeling count data, $F(.;\gamma,H)$ denotes the cdf of a Poisson$(H\gamma)$ variable, where $H$ denotes the offset term, while for modeling Binomial data, $F(.;\gamma,H)$ denotes the cdf of a Binomial$(H,\gamma)$ variable, where $H$ denotes the number of trials. Further, latent variables are assumed to independently follow a $y_i^* \sim N(0,1)$ distribution, where the mean and variance are restricted to be zero and one as they are non-identifiable by the data. Joint vectors $(y_i^{*},x_{i})$ are modeled utilizing Gaussian distributions. Then, with $\theta_h$ denoting model parameters associated with the $h$th component, the joint density $f(y_{i},x_{i}|\theta_h)$ takes the form $$f(y_{i},x_{i}|\theta_h) = \int_{c_{i,y_i-1}}^{c_{i,y_i}} N_{p+1}(y_{i}^{*},x_{i}|\mu_{h},C_h) dy_{i}^{*},$$ where $\mu_h$ and $C_h$ denote the mean vector and covariance matrix, respectively. The joint distribution of the latent variable $y_i^{*}$ and the covariates $x_{i}$ is $$(y_{i}^{*},x_{i}^T)^T|\theta_h \sim N_{p+1}\left( \begin{array}{ll} \left( \begin{array}{l} 0 \ \mu_h \ \end{array} \right), & C_h=\left[ \begin{array}{ll} 1 & \nu_h^T \ \nu_h & \Sigma_h \ \end{array} \right] \end{array}\right),$$ where $\nu_h$ denotes the vector of covariances cov$(y_{i}^{*},x_{i}|\theta_h)$. Sampling from the posterior of constrained covariance matrix $C_h$ is done using methods similar to those of McCulloch et al. (2000). Specifically, the conditional $x_{i}|y_{i}^{*} \sim N_{p}(\mu_h+y_{i}^{*}\nu_h, B_h = \Sigma_h - \nu_h \nu_h^T)$ simplifies matters as there are no constraints on matrix $B_h$ (other than positive definiteness). Given priors for $B_h$ and $\nu_h$, it is easy to sample from their posteriors, and thus obtain samples from the posterior of $\Sigma_h=B_h+\nu_h \nu_h^T$. Specification of the prior distributions:

Define$T_h=B_h^{-1} = (\Sigma_{h} - \nu_h \nu_h^T)^{-1}, h \geq 1$. We specify that a priori$T_h \sim$Wishart$_{p}(V,$Vdf$)$, where$V$is a$p \times p$scale matrix and Vdf is a scalar degrees of freedom parameter. Default values are:$V = I_{p}/p$and Vdf$=p$, however, these can be changed using argumentsVand Vdf.
The assumed prior is$\nu_h \sim N_p(\mu_{\nu},\Sigma_{\nu}), h \geq 1$, with default vaules$\mu_{\nu}=0$and$\Sigma_{\nu} = I_{p}$. ArgumentsMu.nuandSigma.nuallow the user to change the default values.
A priori$\mu_{h} \sim N_p(\mu_{\mu},\Sigma_{\mu}), h \geq 1$. Here the default values are$\mu_{\mu} = \bar{x}$where$\bar{x}$denotes the sample mean of the covariates, and$\Sigma_{\mu} = D$where$D$denotes a diagonal matrix with diagonal elements equal to the square of the observed range of the covariates. ArgumentsMu.muandSigma.muallow the user to change the default values.
For count data, withfamily=poisson, a priori we take$\gamma_{h} \sim$Gamma$(\alpha_{\gamma},\beta_{\gamma}), h \geq 1$. The default values are$\alpha_{\gamma}=1.0,\beta_{\gamma}=0.1$, that define a Gamma distribution with mean$\alpha_{\gamma}/\beta_{\gamma}=10$and variance$\alpha_{\gamma}/\beta_{\gamma}^2=100.$For binomial data, withfamily=binomial, a priori we take$\gamma_{h} \sim$Beta$(\alpha_{\gamma},\beta_{\gamma}), h \geq 1$. The default values are$\alpha_{\gamma}=1.0,\beta_{\gamma}=1.0$, that define a uniform distribution. Users can alter the default using using argumentsAlpha.gammaandBeta.gamma.
The concentration parameter$\alpha$is assigned a Gamma$(\alpha_{\alpha},\beta_{\alpha})$prior over the range$(c_{\alpha},\infty)$, that is,$f(\alpha) \propto \alpha^{\alpha_{\alpha}-1} \exp{-\alpha \beta_{\alpha}} I[\alpha > c_{\alpha}]$, where$I[.]$is the indicator function. The default values are$\alpha_{\alpha}=2.0, \beta_{\alpha}=4.0$, and$c_{\alpha}=0.25$. Users can alter the default using using argumentsAlpha.alpha,Beta.alphaandTurnc.alpha.

References

McCulloch, R. E., Polson, N. G., & Rossi, P. E. (2000). A Bayesian analysis of the multinomial probit model with fully identified parameters. Journal of Econometrics, 99(1), 173-193. Papageorgiou, G., Richardson, S. and Best, N. (2014). Bayesian nonparametric models for spatially indexed data of mixed type. Papaspiliopoulos, O. (2008). A note on posterior sampling from Dirichlet mixture models. Technical report, University of Warwick. Sethuraman, J. (1994). A constructive definition of Dirichlet priors. Statistica Sinica, 4, 639-650. Walker, S. G. (2007). Sampling the Dirichlet mixture model with slices. Communications in Statistics Simulation and Computation, 36(1), 45-54.

Examples

Run this code

# Bayesian nonparametric GLM with Binomial response Y and one predictor X
data(simD)
pred<-seq(with(simD,min(X))+0.1,with(simD,max(X))-0.1,length.out=30)
npred<-length(pred)
# fit1 and fit2 define the same model but with different numbers of
# components and posterior samples
fit1 <- bnpglm(cbind(Y,(E-Y))~X, family=binomial, data=simD, ncomp=30, sweeps=150,
               burn=100, Xpred=pred, offsetPred=rep(30,npred))
fit2 <- bnpglm(cbind(Y,(E-Y))~X, family=binomial, data=simD, ncomp=50, sweeps=5000,
               burn=1000, Xpred=pred, offsetPred=rep(30,npred))
plot(X,Y/E)
lines(pred,fit2$medianReg,col=3,lwd=2)

Run the code above in your browser using DataLab