Learn R Programming

BNSP (version 1.0.2)

bnpglm: Bayesian nonparametric generalized linear models

Description

Fits Dirichlet process mixtures of joint response-covariate models, where the covariates are continuous while the discrete responses are represented utilizing continuous latent variables. See `Details' section for a full model description.

Usage

bnpglm(formula,family,data,offset,sampler="slice",WorkingDir,
       ncomp,sweeps,burn,seed,V,Vdf,Mu.nu,Sigma.nu,Mu.mu,Sigma.mu,
       Alpha.xi,Beta.xi,Alpha.alpha,Beta.alpha,Turnc.alpha,
       Xpred,offsetPred,...)

Arguments

formula
a formula defining the response and the covariates e.g. y ~ x.
family
a description of the distribution of the response variable. Currently four options are supported: "poisson", "binomial", "negative binomial" and "beta binomial".
data
an optional data frame, list or environment (or object coercible by `as.data.frame' to a data frame) containing the variables in the model. If not found in `data', the variables are taken from `environment(formula)'.
offset
this can be used to specify an a priori known component to be included in the linear predictor during fitting. This should be `NULL' or a numeric vector of length equal to the sample size. One `offset' term can be included i
sampler
the MCMC algorithm to be utilized. The two options are sampler="slice" which implements a slice sampler (Walker, 2007; Papaspiliopoulos, 2008) and sampler="truncated" which proceeds
WorkingDir
a directory to store files with the posterior samples of all models parameters. If a directory is not provided, files are created in the current directory and removed when the sampler completes.
ncomp
number of mixture components. Defines where the countable mixture of densities [in (1) below] is truncated. Even if sampler="slice" is chosen, ncomp needs to be specified as it is used in the initialization process
sweeps
integer number of posterior samples required.
burn
integer number of samples in the burn-in period. In the files described below, sweeps-burn samples will be written.
seed
optional seed for the random generator.
V
optional scale matrix $V$ of the prior Wishart distribution assigned to precision matrix $T_h$. See `Details' section.
Vdf
optional degrees of freedom Vdf of the prior Wishart distribution assigned to precision matrix $T_h$. See `Details' section.
Mu.nu
optional prior mean $\mu_{\nu}$ of the covariance vector $\nu_h$. See `Details' section.
Sigma.nu
optional prior covariance matrix $\Sigma_{\nu}$ of $\nu_h$. See `Details' section.
Mu.mu
optional prior mean $\mu_{\mu}$ of the mean vector $\mu_h$. See `Details' section.
Sigma.mu
optional prior covariance matrix $\Sigma_{\mu}$ of $\mu_h$. See `Details' section.
Alpha.xi
an optional parameter that depends on the specified family.
  1. Iffamily="poisson", this argument is parameter$\alpha_{\xi}$of the prior of the Poisson rate:$\xi \sim$Gamma($\alpha_{\xi},\beta_{\xi}$).
  2. Iffam
Beta.xi
an optional parameter that depends on the specified family.
  1. Iffamily="poisson", this argument is parameter$\beta_{\xi}$of the prior of the Poisson rate:$\xi \sim$Gamma($\alpha_{\xi},\beta_{\xi}$).
  2. Iffamil
Alpha.alpha
optional shape parameter $\alpha_{\alpha}$ of the Gamma prior assigned to the concentration parameter $\alpha$. See `Details' section.
Beta.alpha
optional rate parameter $\beta_{\alpha}$ of the Gamma prior assigned to concentration parameter $\alpha$. See `Details' section.
Turnc.alpha
optional truncation point $c_{\alpha}$ of the Gamma prior assigned to concentration parameter $\alpha$. See `Details' section.
Xpred
an optional design matrix the rows of which include the covariates $x$ for which the conditional distribution of $Y|x,D$ (where $D$ denotes the data) is calculated. These are treated as `new' covariates i.e. they do not contrib
offsetPred
the offset term associated with the new covariates Xpred. offsetPred is a vector of length equal to the rows of Xpred. If family=poisson, its entries are the associated
...
Other options that will be ignored.

Value

  • Function bnpglm returns the following:
  • callthe matched call.
  • seedthe seed that was used (in case replication of the results is needed).
  • meanRegif Xpred is specified, the function returns the conditional expectation of the response given each new covariate $x$.
  • medianRegif Xpred is specified, the function returns the conditional median of the response given each new covariate $x$.
  • q1Regif Xpred is specified, the function returns the conditional first quantile of the response given each new covariate $x$.
  • q3Regif Xpred is specified, the function returns the conditional third quantile of the response given each new covariate $x$.
  • modeRegif Xpred is specified, the function returns the conditional mode of the response given each new covariate $x$.
  • Further, function bnpglm creates files where the posterior samples are written. These files are (with all file names preceded by `BNSP.'):
  • Th.txtthis file contains samples from the posteriors of the $p \times p$ precision matrices $T_h, h=1,2,\dots,ncomp$. The file is arranged in (sweeps-burn)*ncomp lines and $p^2$ columns. In more detail, each sweep creates ncomp lines representing samples $T_h^{(sw)}, h=1,\dots,ncomp$, where superscript $sw$ represents a particular sweep. The elements of $T_h^{(sw)}$ are written in the columns of the file: the entries in the first $p$ columns of the file are those in the first column (or row) of $T_h^{(sw)}$, while the entries in the last $p$ columns of the file are those in the last column (or row) of $T_h^{(sw)}$.
  • Sigmah.txtthis file contains samples from the posteriors of the $p \times p$ covariance matrices $\Sigma_h, h=1,2,\dots,ncomp$. The file is arranged in (sweeps-burn)*ncomp lines and $p^2$ columns. In more detail, each sweep creates ncomp lines representing samples $\Sigma_h^{(sw)}, h=1,\dots,ncomp$, where superscript $sw$ represents a particular sweep. The elements of $\Sigma_h^{(sw)}$ are written in the columns of the file: the entries in the first $p$ columns of the file are those in the first column (or row) of $\Sigma_h^{(sw)}$, while the entries in the last $p$ columns of the file are those in the last column (or row) of $\Sigma_h^{(sw)}$.
  • SigmahI.txtthis file contains samples from the posteriors of the $p \times p$ precision matrices $\Sigma_h^{-1}, h=1,2,\dots,ncomp$. The file is arranged in (sweeps-burn)*ncomp lines and $p^2$ columns. In more detail, each sweep creates ncomp lines representing samples $(\Sigma_h^{-1})^{(sw)}, h=1,\dots,ncomp$, where superscript $sw$ represents a particular sweep. The elements of $(\Sigma_h^{-1})^{(sw)}$ are written in the columns of the file: the entries in the first $p$ columns of the file are those in the first column (or row) of $(\Sigma_h^{-1})^{(sw)}$, while the entries in the last $p$ columns of the file are those in the last column (or row) of $(\Sigma_h^{-1})^{(sw)}$.
  • nuh.txtthis file contains samples from the posteriors of the $p$-dimensional covariance vectors $\nu_h, h=1,2,\dots,ncomp$. The file is arranged in (sweeps-burn)*ncomp lines and $p$ columns. In more detail, each sweep creates ncomp lines representing samples $\nu_h^{(sw)}, h=1,\dots,ncomp$, where superscript $sw$ represents a particular sweep. The elements of $\nu_h^{(sw)}$ are written in the columns of the file.
  • muh.txtthis file contains samples from the posteriors of the $p$-dimensional mean vectors $\mu_h, h=1,2,\dots,ncomp$. The file is arranged in (sweeps-burn)*ncomp lines and $p$ columns. In more detail, each sweep creates ncomp lines representing samples $\mu_h^{(sw)}, h=1,\dots,ncomp$, where superscript $sw$ represents a particular sweep. The elements of $\mu_h^{(sw)}$ are written in the columns of the file.
  • xih.txtthis file contains samples from the posteriors of parameters $\xi_h$, $h=1,2,\dots,ncomp$. The file is arranged in (sweeps-burn)*ncomp lines and one or two columns, depending on the number of parameters in the selected $F(.;\lambda)$. Sweeps write in the file ncomp lines representing samples $\xi_h^{(sw)}, h=1,\dots,ncomp$, where superscript $sw$ represents a particular sweep.
  • alpha.txtthis file contains samples from the posterior of the concentration parameters $\alpha$. The file is arranged in (sweeps-burn) lines and one column, each line including one posterior sample.
  • compAlloc.txtthis file contains the allocations or configurations obtained at each iteration of the sampler. It consists of (sweeps-burn) lines, that represent the posterior samples, and $n$ columns, that represent the sampling units. Entries in this file range from 0 to $ncomp-1$.
  • nmembers.txtthis file contains (sweeps-burn) lines and ncomp columns, where the lines represent posterior samples while the columns represent the components or clusters. The entries represent the number of sampling units allocated to the components.
  • Updated.txtthis file contains (sweeps-burn) lines with the number of components updated at each iteration of the sampler.
  • PD.txtthis file contains samples from the posterior conditional distribution $Y|x,D$ described in Xpred. The file has (sweeps-burn)*npred lines, where npred is the number of rows in Xpred. That is, at each iteration of the sampler, one line for each `new' covariate vector $x$ is written. The columns of the file represent the possible values of $Y$, staring from zero and continuing to a max number.

Details

Function bnpglm returns samples from the posterior distributions of the parameters of the model: $$f(y_i,x_i) = \sum_{h=0}^{\infty} \pi_h f(y_i,x_i|\theta_h), \hspace{80pt} (1)$$ where $y_i$ is a univariate discrete response, $x_i$ is a $p$-dimensional vector of continuous covariates, and $\pi_h, h \geq 1,$ are obtained according to Sethuraman's (1994) stick-breaking construction: $\pi_1 = v_1$, and for $l \geq 2, \pi_l = v_l \prod_{j=1}^{l-1} (1-v_j)$, where $v_k$ are iid samples $v_k \sim$Beta $(1,\alpha), k \geq 1.$ The discrete responses $y_i$ are represented as discretized versions of continuous latent variables $y_i^*$. Observed discrete and continuous latent variables are connected by: $$y_{i} = q \iff c_{i,q-1} < y^*_{i} < c_{i,q},$$ where the cut-points are obtained as: $c_{i,-1} = -\infty$, while for $q \geq 0$, $c_{i,q} = c_{q}(\lambda_{i}) = \Phi^{-1}{F(q;\lambda_i)}.$ Here $\Phi(.)$ is the cumulative distribution function (cdf) of a standard normal variable and $F()$ denotes an appropriate cdf. Further, latent variables are assumed to independently follow a $N(0,1)$ distribution, where the mean and variance are restricted to be zero and one as they are non-identifiable by the data. Choices for $F()$ are described next. For counts, currently two options are supported. First, $F(.;\lambda_i)$ can be specified as the cdf of a Poisson$(H_i \xi_h)$ variable. Here $\lambda_i=(\xi_h,H_i)^T, \xi_h$ denotes the Poisson rate associated with cluster $h$, and $H_i$ the offset term associated with sampling unit $i$. Second, $F(.;\lambda_i)$ can be specified as the negative binomial cdf, where $\lambda=(\xi_{1h},\xi_{2h},H_i)^T$. For Binomial data, currently two options are supported. First, $F(.;\lambda_i)$ may be taken to be the cdf of a Binomial$(H_i,\xi_h)$ variable, with $\xi_h$ denoting the success probability of cluster $h$ and $H_i$ the number of trials associated with sampling unit $i$. Second, $F(.;\lambda_i)$ may be specified to be the beta-binomial cdf, where $\lambda=(\xi_{1h},\xi_{2h},H_i)^T$. Joint vectors $(y_i^{*},x_{i})$ are modeled utilizing Gaussian distributions. Then, with $\theta_h$ denoting model parameters associated with the $h$th cluster, the joint density $f(y_{i},x_{i}|\theta_h)$ takes the form $$f(y_{i},x_{i}|\theta_h) = \int_{c_{i,y_i-1}}^{c_{i,y_i}} N_{p+1}(y_{i}^{*},x_{i}|\mu_{h},C_h) dy_{i}^{*},$$ where $\mu_h$ and $C_h$ denote the mean vector and covariance matrix, respectively. The joint distribution of the latent variable $y_i^{*}$ and the covariates $x_{i}$ is $$(y_{i}^{*},x_{i}^T)^T|\theta_h \sim N_{p+1}\left( \begin{array}{ll} \left( \begin{array}{l} 0 \ \mu_h \ \end{array} \right), & C_h=\left[ \begin{array}{ll} 1 & \nu_h^T \ \nu_h & \Sigma_h \ \end{array} \right] \end{array}\right),$$ where $\nu_h$ denotes the vector of covariances cov$(y_{i}^{*},x_{i}|\theta_h)$. Sampling from the posterior of constrained covariance matrix $C_h$ is done using methods similar to those of McCulloch et al. (2000). Specifically, the conditional $x_{i}|y_{i}^{*} \sim N_{p}(\mu_h+y_{i}^{*}\nu_h, B_h = \Sigma_h - \nu_h \nu_h^T)$ simplifies matters as there are no constraints on matrix $B_h$ (other than positive definiteness). Given priors for $B_h$ and $\nu_h$, it is easy to sample from their posteriors, and thus obtain samples from the posterior of $\Sigma_h=B_h+\nu_h \nu_h^T$. Specification of the prior distributions:
  1. Define$T_h=B_h^{-1} = (\Sigma_{h} - \nu_h \nu_h^T)^{-1}, h \geq 1$. We specify that a priori$T_h \sim$Wishart$_{p}(V,$Vdf$)$, where$V$is a$p \times p$scale matrix and Vdf is a scalar degrees of freedom parameter. Default values are:$V = I_{p}/p$and Vdf$=p$, however, these can be changed using argumentsVand Vdf.
  2. The assumed prior for$\nu_h$is$N_p(\mu_{\nu},\Sigma_{\nu}), h \geq 1$, with default vaules$\mu_{\nu}=0$and$\Sigma_{\nu} = I_{p}$. ArgumentsMu.nuandSigma.nuallow the user to change the default values.
  3. A priori$\mu_{h} \sim N_p(\mu_{\mu},\Sigma_{\mu}), h \geq 1$. Here the default values are$\mu_{\mu} = \bar{x}$where$\bar{x}$denotes the sample mean of the covariates, and$\Sigma_{\mu} = D$where$D$denotes a diagonal matrix with diagonal elements equal to the square of the observed range of the covariates. ArgumentsMu.muandSigma.muallow the user to change the default values.
  4. For count data, withfamily="poisson", a priori we take$\xi_{h} \sim$Gamma$(\alpha_{\xi},\beta_{\xi}), h \geq 1$. The default values are$\alpha_{\xi}=1.0,\beta_{\xi}=0.1$, that define a Gamma distribution with mean$\alpha_{\xi}/\beta_{\xi}=10$and variance$\alpha_{\xi}/\beta_{\xi}^2=100.$Defaults can be altered using argumentsAlpha.xiandBeta.xi. For binomial data, withfamily="binomial", a priori we take$\xi_{h} \sim$Beta$(\alpha_{\xi},\beta_{\xi})$,$h \geq 1$. The default values are$\alpha_{\xi}=1.0,\beta_{\xi}=1.0$, that define a uniform distribution. Defaults can be altered using argumentsAlpha.xiandBeta.xi. For both count data withfamily="negative binomial"and binomial data withfamily="beta binomial", a priori we take$\xi_{jh} \sim$Gamma$(\alpha_{j\xi},\beta_{j\xi})$,$j=1,2, h \geq 1$. The default values are$\alpha_{j\xi}=1.0,\beta_{j\xi}=0.1$. Default values for${\alpha_{j\xi}: j=1,2}$can be altered using argumentAlpha.xi, and default values for${\beta_{j\xi}: j=1,2}$can be altered using argumentBeta.xi.
  5. The concentration parameter$\alpha$is assigned a Gamma$(\alpha_{\alpha},\beta_{\alpha})$prior over the range$(c_{\alpha},\infty)$, that is,$f(\alpha) \propto \alpha^{\alpha_{\alpha}-1} \exp{-\alpha \beta_{\alpha}} I[\alpha > c_{\alpha}]$, where$I[.]$is the indicator function. The default values are$\alpha_{\alpha}=2.0, \beta_{\alpha}=4.0$, and$c_{\alpha}=0.25$. Users can alter the default using using argumentsAlpha.alpha,Beta.alphaandTurnc.alpha.

References

McCulloch, R. E., Polson, N. G., & Rossi, P. E. (2000). A Bayesian analysis of the multinomial probit model with fully identified parameters. Journal of Econometrics, 99(1), 173-193. Papageorgiou, G., Richardson, S. and Best, N. (2014). Bayesian nonparametric models for spatially indexed data of mixed type. Papaspiliopoulos, O. (2008). A note on posterior sampling from Dirichlet mixture models. Technical report, University of Warwick. Sethuraman, J. (1994). A constructive definition of Dirichlet priors. Statistica Sinica, 4, 639-650. Walker, S. G. (2007). Sampling the Dirichlet mixture model with slices. Communications in Statistics Simulation and Computation, 36(1), 45-54.

Examples

Run this code
# Bayesian nonparametric GLM with Binomial response Y and one predictor X
data(simD)
pred<-seq(with(simD,min(X))+0.1,with(simD,max(X))-0.1,length.out=30)
npred<-length(pred)
# fit1 and fit2 define the same model but with different numbers of
# components and posterior samples
fit1 <- bnpglm(cbind(Y,(E-Y))~X, family="binomial", data=simD, ncomp=30, sweeps=150,
               burn=100, Xpred=pred, offsetPred=rep(30,npred))
fit2 <- bnpglm(cbind(Y,(E-Y))~X, family="binomial", data=simD, ncomp=50, sweeps=5000,
               burn=1000, Xpred=pred, offsetPred=rep(30,npred))
plot(with(simD,X),with(simD,Y)/with(simD,E))
lines(pred,fit2$medianReg,col=3,lwd=2)

Run the code above in your browser using DataLab