bnpglm: Bayesian nonparametric generalized linear models

Description

Fits Dirichlet process mixtures of joint response-covariate models, where the covariates are continuous while the discrete responses are represented utilizing continuous latent variables. See `Details' section for a full model description.

Usage

bnpglm(formula,family,data,offset,sampler="slice",WorkingDir,
       ncomp,sweeps,burn,seed,V,Vdf,Mu.nu,Sigma.nu,Mu.mu,Sigma.mu,
       Alpha.xi,Beta.xi,Alpha.alpha,Beta.alpha,Turnc.alpha,
       Xpred,offsetPred,...)

Arguments

formula

a formula defining the response and the covariates e.g. y ~ x.

family

a description of the distribution of the response variable. Currently four options are supported: "poisson", "binomial", "negative binomial" and "beta binomial".

data

an optional data frame, list or environment (or object coercible by `as.data.frame' to a data frame) containing the variables in the model. If not found in `data', the variables are taken from `environment(formula)'.

offset

this can be used to specify an a priori known component to be included in the linear predictor during fitting. This should be `NULL' or a numeric vector of length equal to the sample size. One `offset' term can be included i

sampler

the MCMC algorithm to be utilized. The two options are sampler="slice" which implements a slice sampler (Walker, 2007; Papaspiliopoulos, 2008) and sampler="truncated" which proceeds

WorkingDir

a directory to store files with the posterior samples of all models parameters. If a directory is not provided, files are created in the current directory and removed when the sampler completes.

ncomp

number of mixture components. Defines where the countable mixture of densities [in (1) below] is truncated. Even if sampler="slice" is chosen, ncomp needs to be specified as it is used in the initialization process

sweeps

integer number of posterior samples required.

burn

integer number of samples in the burn-in period. In the files described below, sweeps-burn samples will be written.

seed

optional seed for the random generator.

optional scale matrix $V$ of the prior Wishart distribution assigned to precision matrix $T_h$. See `Details' section.

Vdf

optional degrees of freedom Vdf of the prior Wishart distribution assigned to precision matrix $T_h$. See `Details' section.

Mu.nu

optional prior mean $\mu_{\nu}$ of the covariance vector $\nu_h$. See `Details' section.

Sigma.nu

optional prior covariance matrix $\Sigma_{\nu}$ of $\nu_h$. See `Details' section.

Mu.mu

optional prior mean $\mu_{\mu}$ of the mean vector $\mu_h$. See `Details' section.

Sigma.mu

optional prior covariance matrix $\Sigma_{\mu}$ of $\mu_h$. See `Details' section.

Alpha.xi

an optional parameter that depends on the specified family.

Iffamily="poisson", this argument is parameter$\alpha_{\xi}$of the prior of the Poisson rate:$\xi \sim$Gamma($\alpha_{\xi},\beta_{\xi}$).
Iffam

Beta.xi

an optional parameter that depends on the specified family.

Iffamily="poisson", this argument is parameter$\beta_{\xi}$of the prior of the Poisson rate:$\xi \sim$Gamma($\alpha_{\xi},\beta_{\xi}$).
Iffamil

Alpha.alpha

optional shape parameter $\alpha_{\alpha}$ of the Gamma prior assigned to the concentration parameter $\alpha$. See `Details' section.

Beta.alpha

optional rate parameter $\beta_{\alpha}$ of the Gamma prior assigned to concentration parameter $\alpha$. See `Details' section.

Turnc.alpha

optional truncation point $c_{\alpha}$ of the Gamma prior assigned to concentration parameter $\alpha$. See `Details' section.

Xpred

an optional design matrix the rows of which include the covariates $x$ for which the conditional distribution of $Y|x,D$ (where $D$ denotes the data) is calculated. These are treated as `new' covariates i.e. they do not contrib

offsetPred

the offset term associated with the new covariates Xpred. offsetPred is a vector of length equal to the rows of Xpred. If family=poisson, its entries are the associated

...

Other options that will be ignored.

Value

Function bnpglm returns the following:
callthe matched call.
seedthe seed that was used (in case replication of the results is needed).
meanRegif Xpred is specified, the function returns the conditional expectation of the response given each new covariate $x$.
medianRegif Xpred is specified, the function returns the conditional median of the response given each new covariate $x$.
q1Regif Xpred is specified, the function returns the conditional first quantile of the response given each new covariate $x$.
q3Regif Xpred is specified, the function returns the conditional third quantile of the response given each new covariate $x$.
modeRegif Xpred is specified, the function returns the conditional mode of the response given each new covariate $x$.
Further, function bnpglm creates files where the posterior samples are written. These files are (with all file names preceded by `BNSP.'):
Th.txtthis file contains samples from the posteriors of the $p \times p$ precision matrices $T_h, h=1,2,\dots,ncomp$. The file is arranged in (sweeps-burn)*ncomp lines and $p^2$ columns. In more detail, each sweep creates ncomp lines representing samples $T_h^{(sw)}, h=1,\dots,ncomp$, where superscript $sw$ represents a particular sweep. The elements of $T_h^{(sw)}$ are written in the columns of the file: the entries in the first $p$ columns of the file are those in the first column (or row) of $T_h^{(sw)}$, while the entries in the last $p$ columns of the file are those in the last column (or row) of $T_h^{(sw)}$.
Sigmah.txtthis file contains samples from the posteriors of the $p \times p$ covariance matrices $\Sigma_h, h=1,2,\dots,ncomp$. The file is arranged in (sweeps-burn)*ncomp lines and $p^2$ columns. In more detail, each sweep creates ncomp lines representing samples $\Sigma_h^{(sw)}, h=1,\dots,ncomp$, where superscript $sw$ represents a particular sweep. The elements of $\Sigma_h^{(sw)}$ are written in the columns of the file: the entries in the first $p$ columns of the file are those in the first column (or row) of $\Sigma_h^{(sw)}$, while the entries in the last $p$ columns of the file are those in the last column (or row) of $\Sigma_h^{(sw)}$.
SigmahI.txtthis file contains samples from the posteriors of the $p \times p$ precision matrices $\Sigma_h^{-1}, h=1,2,\dots,ncomp$. The file is arranged in (sweeps-burn)*ncomp lines and $p^2$ columns. In more detail, each sweep creates ncomp lines representing samples $(\Sigma_h^{-1})^{(sw)}, h=1,\dots,ncomp$, where superscript $sw$ represents a particular sweep. The elements of $(\Sigma_h^{-1})^{(sw)}$ are written in the columns of the file: the entries in the first $p$ columns of the file are those in the first column (or row) of $(\Sigma_h^{-1})^{(sw)}$, while the entries in the last $p$ columns of the file are those in the last column (or row) of $(\Sigma_h^{-1})^{(sw)}$.
nuh.txtthis file contains samples from the posteriors of the $p$-dimensional covariance vectors $\nu_h, h=1,2,\dots,ncomp$. The file is arranged in (sweeps-burn)*ncomp lines and $p$ columns. In more detail, each sweep creates ncomp lines representing samples $\nu_h^{(sw)}, h=1,\dots,ncomp$, where superscript $sw$ represents a particular sweep. The elements of $\nu_h^{(sw)}$ are written in the columns of the file.
muh.txtthis file contains samples from the posteriors of the $p$-dimensional mean vectors $\mu_h, h=1,2,\dots,ncomp$. The file is arranged in (sweeps-burn)*ncomp lines and $p$ columns. In more detail, each sweep creates ncomp lines representing samples $\mu_h^{(sw)}, h=1,\dots,ncomp$, where superscript $sw$ represents a particular sweep. The elements of $\mu_h^{(sw)}$ are written in the columns of the file.
xih.txtthis file contains samples from the posteriors of parameters $\xi_h$, $h=1,2,\dots,ncomp$. The file is arranged in (sweeps-burn)*ncomp lines and one or two columns, depending on the number of parameters in the selected $F(.;\lambda)$. Sweeps write in the file ncomp lines representing samples $\xi_h^{(sw)}, h=1,\dots,ncomp$, where superscript $sw$ represents a particular sweep.
alpha.txtthis file contains samples from the posterior of the concentration parameters $\alpha$. The file is arranged in (sweeps-burn) lines and one column, each line including one posterior sample.
compAlloc.txtthis file contains the allocations or configurations obtained at each iteration of the sampler. It consists of (sweeps-burn) lines, that represent the posterior samples, and $n$ columns, that represent the sampling units. Entries in this file range from 0 to $ncomp-1$.
nmembers.txtthis file contains (sweeps-burn) lines and ncomp columns, where the lines represent posterior samples while the columns represent the components or clusters. The entries represent the number of sampling units allocated to the components.
Updated.txtthis file contains (sweeps-burn) lines with the number of components updated at each iteration of the sampler.
PD.txtthis file contains samples from the posterior conditional distribution $Y|x,D$ described in Xpred. The file has (sweeps-burn)*npred lines, where npred is the number of rows in Xpred. That is, at each iteration of the sampler, one line for each `new' covariate vector $x$ is written. The columns of the file represent the possible values of $Y$, staring from zero and continuing to a max number.

Details

Function bnpglm returns samples from the posterior distributions of the parameters of the model: $$f(y_i,x_i) = \sum_{h=0}^{\infty} \pi_h f(y_i,x_i|\theta_h), \hspace{80pt} (1)$$ where $y_i$ is a univariate discrete response, $x_i$ is a $p$-dimensional vector of continuous covariates, and $\pi_h, h \geq 1,$ are obtained according to Sethuraman's (1994) stick-breaking construction: $\pi_1 = v_1$, and for $l \geq 2, \pi_l = v_l \prod_{j=1}^{l-1} (1-v_j)$, where $v_k$ are iid samples $v_k \sim$Beta $(1,\alpha), k \geq 1.$ The discrete responses $y_i$ are represented as discretized versions of continuous latent variables $y_i^*$. Observed discrete and continuous latent variables are connected by: $$y_{i} = q \iff c_{i,q-1} < y^*_{i} < c_{i,q},$$ where the cut-points are obtained as: $c_{i,-1} = -\infty$, while for $q \geq 0$, $c_{i,q} = c_{q}(\lambda_{i}) = \Phi^{-1}{F(q;\lambda_i)}.$ Here $\Phi(.)$ is the cumulative distribution function (cdf) of a standard normal variable and $F()$ denotes an appropriate cdf. Further, latent variables are assumed to independently follow a $N(0,1)$ distribution, where the mean and variance are restricted to be zero and one as they are non-identifiable by the data. Choices for $F()$ are described next. For counts, currently two options are supported. First, $F(.;\lambda_i)$ can be specified as the cdf of a Poisson$(H_i \xi_h)$ variable. Here $\lambda_i=(\xi_h,H_i)^T, \xi_h$ denotes the Poisson rate associated with cluster $h$, and $H_i$ the offset term associated with sampling unit $i$. Second, $F(.;\lambda_i)$ can be specified as the negative binomial cdf, where $\lambda=(\xi_{1h},\xi_{2h},H_i)^T$. For Binomial data, currently two options are supported. First, $F(.;\lambda_i)$ may be taken to be the cdf of a Binomial$(H_i,\xi_h)$ variable, with $\xi_h$ denoting the success probability of cluster $h$ and $H_i$ the number of trials associated with sampling unit $i$. Second, $F(.;\lambda_i)$ may be specified to be the beta-binomial cdf, where $\lambda=(\xi_{1h},\xi_{2h},H_i)^T$. Joint vectors $(y_i^{*},x_{i})$ are modeled utilizing Gaussian distributions. Then, with $\theta_h$ denoting model parameters associated with the $h$th cluster, the joint density $f(y_{i},x_{i}|\theta_h)$ takes the form $$f(y_{i},x_{i}|\theta_h) = \int_{c_{i,y_i-1}}^{c_{i,y_i}} N_{p+1}(y_{i}^{*},x_{i}|\mu_{h},C_h) dy_{i}^{*},$$ where $\mu_h$ and $C_h$ denote the mean vector and covariance matrix, respectively. The joint distribution of the latent variable $y_i^{*}$ and the covariates $x_{i}$ is $$(y_{i}^{*},x_{i}^T)^T|\theta_h \sim N_{p+1}\left( \begin{array}{ll} \left( \begin{array}{l} 0 \ \mu_h \ \end{array} \right), & C_h=\left[ \begin{array}{ll} 1 & \nu_h^T \ \nu_h & \Sigma_h \ \end{array} \right] \end{array}\right),$$ where $\nu_h$ denotes the vector of covariances cov$(y_{i}^{*},x_{i}|\theta_h)$. Sampling from the posterior of constrained covariance matrix $C_h$ is done using methods similar to those of McCulloch et al. (2000). Specifically, the conditional $x_{i}|y_{i}^{*} \sim N_{p}(\mu_h+y_{i}^{*}\nu_h, B_h = \Sigma_h - \nu_h \nu_h^T)$ simplifies matters as there are no constraints on matrix $B_h$ (other than positive definiteness). Given priors for $B_h$ and $\nu_h$, it is easy to sample from their posteriors, and thus obtain samples from the posterior of $\Sigma_h=B_h+\nu_h \nu_h^T$. Specification of the prior distributions:

Define$T_h=B_h^{-1} = (\Sigma_{h} - \nu_h \nu_h^T)^{-1}, h \geq 1$. We specify that a priori$T_h \sim$Wishart$_{p}(V,$Vdf$)$, where$V$is a$p \times p$scale matrix and Vdf is a scalar degrees of freedom parameter. Default values are:$V = I_{p}/p$and Vdf$=p$, however, these can be changed using argumentsVand Vdf.
The assumed prior for$\nu_h$is$N_p(\mu_{\nu},\Sigma_{\nu}), h \geq 1$, with default vaules$\mu_{\nu}=0$and$\Sigma_{\nu} = I_{p}$. ArgumentsMu.nuandSigma.nuallow the user to change the default values.
A priori$\mu_{h} \sim N_p(\mu_{\mu},\Sigma_{\mu}), h \geq 1$. Here the default values are$\mu_{\mu} = \bar{x}$where$\bar{x}$denotes the sample mean of the covariates, and$\Sigma_{\mu} = D$where$D$denotes a diagonal matrix with diagonal elements equal to the square of the observed range of the covariates. ArgumentsMu.muandSigma.muallow the user to change the default values.
For count data, withfamily="poisson", a priori we take$\xi_{h} \sim$Gamma$(\alpha_{\xi},\beta_{\xi}), h \geq 1$. The default values are$\alpha_{\xi}=1.0,\beta_{\xi}=0.1$, that define a Gamma distribution with mean$\alpha_{\xi}/\beta_{\xi}=10$and variance$\alpha_{\xi}/\beta_{\xi}^2=100.$Defaults can be altered using argumentsAlpha.xiandBeta.xi. For binomial data, withfamily="binomial", a priori we take$\xi_{h} \sim$Beta$(\alpha_{\xi},\beta_{\xi})$,$h \geq 1$. The default values are$\alpha_{\xi}=1.0,\beta_{\xi}=1.0$, that define a uniform distribution. Defaults can be altered using argumentsAlpha.xiandBeta.xi. For both count data withfamily="negative binomial"and binomial data withfamily="beta binomial", a priori we take$\xi_{jh} \sim$Gamma$(\alpha_{j\xi},\beta_{j\xi})$,$j=1,2, h \geq 1$. The default values are$\alpha_{j\xi}=1.0,\beta_{j\xi}=0.1$. Default values for${\alpha_{j\xi}: j=1,2}$can be altered using argumentAlpha.xi, and default values for${\beta_{j\xi}: j=1,2}$can be altered using argumentBeta.xi.
The concentration parameter$\alpha$is assigned a Gamma$(\alpha_{\alpha},\beta_{\alpha})$prior over the range$(c_{\alpha},\infty)$, that is,$f(\alpha) \propto \alpha^{\alpha_{\alpha}-1} \exp{-\alpha \beta_{\alpha}} I[\alpha > c_{\alpha}]$, where$I[.]$is the indicator function. The default values are$\alpha_{\alpha}=2.0, \beta_{\alpha}=4.0$, and$c_{\alpha}=0.25$. Users can alter the default using using argumentsAlpha.alpha,Beta.alphaandTurnc.alpha.

References

McCulloch, R. E., Polson, N. G., & Rossi, P. E. (2000). A Bayesian analysis of the multinomial probit model with fully identified parameters. Journal of Econometrics, 99(1), 173-193. Papageorgiou, G., Richardson, S. and Best, N. (2014). Bayesian nonparametric models for spatially indexed data of mixed type. Papaspiliopoulos, O. (2008). A note on posterior sampling from Dirichlet mixture models. Technical report, University of Warwick. Sethuraman, J. (1994). A constructive definition of Dirichlet priors. Statistica Sinica, 4, 639-650. Walker, S. G. (2007). Sampling the Dirichlet mixture model with slices. Communications in Statistics Simulation and Computation, 36(1), 45-54.

Examples

Run this code

# Bayesian nonparametric GLM with Binomial response Y and one predictor X
data(simD)
pred<-seq(with(simD,min(X))+0.1,with(simD,max(X))-0.1,length.out=30)
npred<-length(pred)
# fit1 and fit2 define the same model but with different numbers of
# components and posterior samples
fit1 <- bnpglm(cbind(Y,(E-Y))~X, family="binomial", data=simD, ncomp=30, sweeps=150,
               burn=100, Xpred=pred, offsetPred=rep(30,npred))
fit2 <- bnpglm(cbind(Y,(E-Y))~X, family="binomial", data=simD, ncomp=50, sweeps=5000,
               burn=1000, Xpred=pred, offsetPred=rep(30,npred))
plot(with(simD,X),with(simD,Y)/with(simD,E))
lines(pred,fit2$medianReg,col=3,lwd=2)

Run the code above in your browser using DataLab