Learn R Programming

BNSP (version 1.0.0)

bnpglm: Bayesian nonparametric generalized linear models

Description

Fits Dirichlet Process mixtures of joint response-covariate models, where the covariates are continuous while the discrete response is represented utilizing continuous latent variables. See `Details' section for a full model description.

Usage

bnpglm(formula,family=poisson,data,offset,sampler="slice",WorkingDir,
       ncomp,sweeps,burn,seed,V,Vdf,Mu.nu,Sigma.nu,Mu.mu,Sigma.mu,
       Alpha.gamma,Beta.gamma,Alpha.alpha,Beta.alpha,Turnc.alpha,
       Xpred,offsetPred,...)

Arguments

formula
a formula defining the response and the covariates e.g. y ~ x.
family
a description of the distribution of the response variable. family=poisson and family=binomial are the current options.
data
an optional data frame, list or environment (or object coercible by `as.data.frame' to a data frame) containing the variables in the model. If not found in `data', the variables are taken from `environment(formula)'.
offset
this can be used to specify an a priori known component to be included in the linear predictor during fitting. This should be `NULL' or a numeric vector of length equal to the sample size. One `offset' term can be included i
sampler
the MCMC algorithm to be utilized. The two options are sampler="slice" which implements the slice sampler (Walker, 2007; Papaspiliopoulos, 2008) and sampler="truncated" which proceeds
WorkingDir
a directory to store files with the posterior samples of all models parameters. If a directory is not provided, files are not written.
ncomp
number of mixture components. Defines where the countable mixture of densities [in (1) below] is truncated. Even if sampler="slice" is chosen, ncomp needs to be specified as it is used in the initialization process
sweeps
integer number of posterior samples required.
burn
integer number of samples in the burn-in period. In the files described below, sweeps-burn samples will be written.
seed
optional seed for the random generator.
V
scale matrix $V$ of the prior Wishart distribution assigned to precision matrix $T_h$. See `Details' section.
Vdf
degrees of freedom Vdf of the prior Wishart distribution assigned to precision matrix $T_h$. See `Details' section.
Mu.nu
prior mean $\mu_{\nu}$ of the covariance vector $\nu_h$. See `Details' section.
Sigma.nu
prior covariance matrix $\Sigma_{\nu}$ of $\nu_h$. See `Details' section.
Mu.mu
prior mean $\mu_{\mu}$ of the mean vector $\mu_h$. See `Details' section.
Sigma.mu
prior covariance matrix $\Sigma_{\mu}$ of $\mu_h$. See `Details' section.
Alpha.gamma
shape parameter $\alpha_{\gamma}$ of the Gamma prior assigned to the Poisson rate $\gamma_h$. See `Details' section.
Beta.gamma
rate parameter $\beta_{\gamma}$ of the Gamma prior assigned to the Poisson rate $\gamma_h$. See `Details' section.
Alpha.alpha
shape parameter $\alpha_{\alpha}$ of the Gamma prior assigned to the concentration parameter $\alpha$. See `Details' section.
Beta.alpha
rate parameter $\beta_{\alpha}$ of the Gamma prior assigned to concentration parameter $\alpha$. See `Details' section.
Turnc.alpha
truncation point $c_{\alpha}$ of the Gamma prior assigned to concentration parameter $\alpha$. See `Details' section.
Xpred
A design matrix the rows of which include the covariates $x$ for which the conditional distribution of $Y|x,D$ (where $D$ denotes the data) is calculated. These are treated as `new' covariates i.e. they do not contribute to the
offsetPred
The offset term associated with the new covariates Xpred. offsetPred is a vector of length equal to the rows of Xpred. If family=poisson, its entries are the associated
...
Other options that will be ignored.

Value

  • Function bnpglm returns the following:
  • callthe matched call.
  • seedthe seed that was used (in case replication of the results is needed).
  • meanRegif Xpred is specified, the function returns the conditional expectation of the response given each new covariate $x$.
  • medianRegif Xpred is specified, the function returns the conditional median of the response given each new covariate $x$.
  • q1Regif Xpred is specified, the function returns the conditional first quantile of the response given each new covariate $x$.
  • q3Regif Xpred is specified, the function returns the conditional third quantile of the response given each new covariate $x$.
  • modeRegif Xpred is specified, the function returns the conditional mode of the response given each new covariate $x$.
  • Further, function bnpglm creates files where the posterior samples are written. These files are (with all file names preceded by `BNSP.'):
  • Th.txtthis file contains samples from the posteriors of the $p \times p$ precision matrices $T_h, h=1,2,\dots,ncomp$. The file is arranged in (sweeps-burn)*ncomp lines and $p^2$ columns. In more detail, each sweep creates ncomp lines representing samples $T_h^{(sw)}, h=1,\dots,ncomp$, where superscript $sw$ represents a particular sweep. The elements of $T_h^{(sw)}$ are written in the columns of the file: the entries in the first $p$ columns of the file are those in the first column (or row) of $T_h^{(sw)}$, while the entries in the last $p$ columns of the file are those in the last column (or row) of $T_h^{(sw)}$.
  • Sigmah.txtthis file contains samples from the posteriors of the $p \times p$ covariance matrices $\Sigma_h, h=1,2,\dots,ncomp$. The file is arranged in (sweeps-burn)*ncomp lines and $p^2$ columns. In more detail, each sweep creates ncomp lines representing samples $\Sigma_h^{(sw)}, h=1,\dots,ncomp$, where superscript $sw$ represents a particular sweep. The elements of $\Sigma_h^{(sw)}$ are written in the columns of the file: the entries in the first $p$ columns of the file are those in the first column (or row) of $\Sigma_h^{(sw)}$, while the entries in the last $p$ columns of the file are those in the last column (or row) of $\Sigma_h^{(sw)}$.
  • SigmahI.txtthis file contains samples from the posteriors of the $p \times p$ precision matrices $\Sigma_h^{-1}, h=1,2,\dots,ncomp$. The file is arranged in (sweeps-burn)*ncomp lines and $p^2$ columns. In more detail, each sweep creates ncomp lines representing samples $(\Sigma_h^{-1})^{(sw)}, h=1,\dots,ncomp$, where superscript $sw$ represents a particular sweep. The elements of $(\Sigma_h^{-1})^{(sw)}$ are written in the columns of the file: the entries in the first $p$ columns of the file are those in the first column (or row) of $(\Sigma_h^{-1})^{(sw)}$, while the entries in the last $p$ columns of the file are those in the last column (or row) of $(\Sigma_h^{-1})^{(sw)}$.
  • nuh.txtthis file contains samples from the posteriors of the $p$-dimensional covariance vectors $\nu_h, h=1,2,\dots,ncomp$. The file is arranged in (sweeps-burn)*ncomp lines and $p$ columns. In more detail, each sweep creates ncomp lines representing samples $\nu_h^{(sw)}, h=1,\dots,ncomp$, where superscript $sw$ represents a particular sweep. The elements of $\nu_h^{(sw)}$ are written in the columns of the file.
  • muh.txtthis file contains samples from the posteriors of the $p$-dimensional mean vectors $\mu_h, h=1,2,\dots,ncomp$. The file is arranged in (sweeps-burn)*ncomp lines and $p$ columns. In more detail, each sweep creates ncomp lines representing samples $\mu_h^{(sw)}, h=1,\dots,ncomp$, where superscript $sw$ represents a particular sweep. The elements of $\mu_h^{(sw)}$ are written in the columns of the file.
  • gammah.txtthis file contains samples from the posteriors of the mean parameters $\gamma_h$, $h=1,2,\dots,ncomp$. The file is arranged in (sweeps-burn)*ncomp lines and one column. Sweeps write in the file ncomp lines representing samples $\gamma_h^{(sw)}, h=1,\dots,ncomp$, where superscript $sw$ represents a particular sweep.
  • alpha.txtthis file contains samples from the posterior of the concentration parameters $\alpha$. The file is arranged in (sweeps-burn) lines and one column, each line including one posterior sample.
  • compAlloc.txtthis file contains the allocations or configurations obtained at each iteration of the sampler. It consists of (sweeps-burn) lines, that represent the posterior samples, and $n$ columns, that represent the sampling units. Entries in this file range from 0 to $ncomp-1$.
  • nmembers.txtthis file contains (sweeps-burn) lines and ncomp columns, where the lines represent posterior samples while the columns represent the components or clusters. The entries represent the number of sampling units allocated to the components.
  • Updated.txtthis file contains (sweeps-burn) lines with the number of components updated at each iteration of the sampler.
  • PD.txtthis file contains samples from the posterior conditional distribution $Y|x,D$ described in Xpred. The file has (sweeps-burn)*npred lines, where npred is the number of rows in Xpred. That is, at each iteration of the sampler, one line for each `new' covariate vector $x$ is written. The columns of the file represent the possible values of $Y$, staring from zero and continuing to a max number.

Details

Function bnpglm returns samples from the posterior distributions of the parameters of the model: $$f(y_i,x_i) = \sum_{h=0}^{\infty} \pi_h f(y_i,x_i|\theta_h), \hspace{80pt} (1)$$ where $y_i$ is a univariate response with distribution that belongs in the exponential family, $x_i$ is a $p$-dimensional vector of continuous covariates, and $\pi_h, h \geq 1,$ are obtained according to Sethuraman's (1994) stick-breaking construction: $\pi_1 = v_1$, and for $l \geq 2, \pi_l = v_l \prod_{j=1}^{l-1} (1-v_j)$, where $v_k$ are iid samples $v_k \sim$Beta$(1,\alpha), k \geq 1.$ The discrete responses $y_i$ are represented as discretized versions of continuous latent variables $y_i^*$. Observed discrete and continuous latent variables are connected by: $$y_{i} = q \iff c_{h,q-1} < y^*_{i} < c_{h,q},$$ where the cut-points are obtained as: $c_{h,-1} = -\infty$, while for $q \geq 0$, $c_{h,q} = c_{q}(\gamma_{h},H_i) = \Phi^{-1}{F(q;\gamma_{h},H_i)}.$ Here $\Phi(.)$ is the cumulative distribution function (cdf) of a standard normal variable, and $F(.;\gamma,H)$ denotes an appropriate cdf. For instance, for modeling count data, $F(.;\gamma,H)$ denotes the cdf of a Poisson$(H\gamma)$ variable, where $H$ denotes the offset term, while for modeling Binomial data, $F(.;\gamma,H)$ denotes the cdf of a Binomial$(H,\gamma)$ variable, where $H$ denotes the number of trials. Further, latent variables are assumed to independently follow a $y_i^* \sim N(0,1)$ distribution, where the mean and variance are restricted to be zero and one as they are non-identifiable by the data. Joint vectors $(y_i^{*},x_{i})$ are modeled utilizing Gaussian distributions. Then, with $\theta_h$ denoting model parameters associated with the $h$th component, the joint density $f(y_{i},x_{i}|\theta_h)$ takes the form $$f(y_{i},x_{i}|\theta_h) = \int_{c_{i,y_i-1}}^{c_{i,y_i}} N_{p+1}(y_{i}^{*},x_{i}|\mu_{h},C_h) dy_{i}^{*},$$ where $\mu_h$ and $C_h$ denote the mean vector and covariance matrix, respectively. The joint distribution of the latent variable $y_i^{*}$ and the covariates $x_{i}$ is $$(y_{i}^{*},x_{i}^T)^T|\theta_h \sim N_{p+1}\left( \begin{array}{ll} \left( \begin{array}{l} 0 \ \mu_h \ \end{array} \right), & C_h=\left[ \begin{array}{ll} 1 & \nu_h^T \ \nu_h & \Sigma_h \ \end{array} \right] \end{array}\right),$$ where $\nu_h$ denotes the vector of covariances cov$(y_{i}^{*},x_{i}|\theta_h)$. Sampling from the posterior of constrained covariance matrix $C_h$ is done using methods similar to those of McCulloch et al. (2000). Specifically, the conditional $x_{i}|y_{i}^{*} \sim N_{p}(\mu_h+y_{i}^{*}\nu_h, B_h = \Sigma_h - \nu_h \nu_h^T)$ simplifies matters as there are no constraints on matrix $B_h$ (other than positive definiteness). Given priors for $B_h$ and $\nu_h$, it is easy to sample from their posteriors, and thus obtain samples from the posterior of $\Sigma_h=B_h+\nu_h \nu_h^T$. Specification of the prior distributions:
  1. Define$T_h=B_h^{-1} = (\Sigma_{h} - \nu_h \nu_h^T)^{-1}, h \geq 1$. We specify that a priori$T_h \sim$Wishart$_{p}(V,$Vdf$)$, where$V$is a$p \times p$scale matrix and Vdf is a scalar degrees of freedom parameter. Default values are:$V = I_{p}/p$and Vdf$=p$, however, these can be changed using argumentsVand Vdf.
  2. The assumed prior is$\nu_h \sim N_p(\mu_{\nu},\Sigma_{\nu}), h \geq 1$, with default vaules$\mu_{\nu}=0$and$\Sigma_{\nu} = I_{p}$. ArgumentsMu.nuandSigma.nuallow the user to change the default values.
  3. A priori$\mu_{h} \sim N_p(\mu_{\mu},\Sigma_{\mu}), h \geq 1$. Here the default values are$\mu_{\mu} = \bar{x}$where$\bar{x}$denotes the sample mean of the covariates, and$\Sigma_{\mu} = D$where$D$denotes a diagonal matrix with diagonal elements equal to the square of the observed range of the covariates. ArgumentsMu.muandSigma.muallow the user to change the default values.
  4. For count data, withfamily=poisson, a priori we take$\gamma_{h} \sim$Gamma$(\alpha_{\gamma},\beta_{\gamma}), h \geq 1$. The default values are$\alpha_{\gamma}=1.0,\beta_{\gamma}=0.1$, that define a Gamma distribution with mean$\alpha_{\gamma}/\beta_{\gamma}=10$and variance$\alpha_{\gamma}/\beta_{\gamma}^2=100.$For binomial data, withfamily=binomial, a priori we take$\gamma_{h} \sim$Beta$(\alpha_{\gamma},\beta_{\gamma}), h \geq 1$. The default values are$\alpha_{\gamma}=1.0,\beta_{\gamma}=1.0$, that define a uniform distribution. Users can alter the default using using argumentsAlpha.gammaandBeta.gamma.
  5. The concentration parameter$\alpha$is assigned a Gamma$(\alpha_{\alpha},\beta_{\alpha})$prior over the range$(c_{\alpha},\infty)$, that is,$f(\alpha) \propto \alpha^{\alpha_{\alpha}-1} \exp{-\alpha \beta_{\alpha}} I[\alpha > c_{\alpha}]$, where$I[.]$is the indicator function. The default values are$\alpha_{\alpha}=2.0, \beta_{\alpha}=4.0$, and$c_{\alpha}=0.25$. Users can alter the default using using argumentsAlpha.alpha,Beta.alphaandTurnc.alpha.

References

McCulloch, R. E., Polson, N. G., & Rossi, P. E. (2000). A Bayesian analysis of the multinomial probit model with fully identified parameters. Journal of Econometrics, 99(1), 173-193. Papageorgiou, G., Richardson, S. and Best, N. (2014). Bayesian nonparametric models for spatially indexed data of mixed type. Papaspiliopoulos, O. (2008). A note on posterior sampling from Dirichlet mixture models. Technical report, University of Warwick. Sethuraman, J. (1994). A constructive definition of Dirichlet priors. Statistica Sinica, 4, 639-650. Walker, S. G. (2007). Sampling the Dirichlet mixture model with slices. Communications in Statistics Simulation and Computation, 36(1), 45-54.

Examples

Run this code
# Bayesian nonparametric GLM with Binomial response Y and one predictor X
data(simD)
pred<-seq(with(simD,min(X))+0.1,with(simD,max(X))-0.1,length.out=30)
npred<-length(pred)
# fit1 and fit2 define the same model but with different numbers of
# components and posterior samples
fit1 <- bnpglm(cbind(Y,(E-Y))~X, family=binomial, data=simD, ncomp=30, sweeps=150,
               burn=100, Xpred=pred, offsetPred=rep(30,npred))
fit2 <- bnpglm(cbind(Y,(E-Y))~X, family=binomial, data=simD, ncomp=50, sweeps=5000,
               burn=1000, Xpred=pred, offsetPred=rep(30,npred))
plot(X,Y/E)
lines(pred,fit2$medianReg,col=3,lwd=2)

Run the code above in your browser using DataLab