bnpglm(formula,family=poisson,data,offset,sampler="slice",WorkingDir,
ncomp,sweeps,burn,seed,V,Vdf,Mu.nu,Sigma.nu,Mu.mu,Sigma.mu,
Alpha.gamma,Beta.gamma,Alpha.alpha,Beta.alpha,Turnc.alpha,
Xpred,offsetPred,...)y ~ x.family=poisson and
family=binomial are the current options.sampler="slice" which
implements the slice sampler (Walker, 2007; Papaspiliopoulos, 2008) and
sampler="truncated" which proceeds
sampler="slice" is chosen, ncomp needs to be specified as it is used in the initialization processsweeps-burn
samples will be written.Xpred. offsetPred is a vector of
length equal to the rows of Xpred. If family=poisson, its entries are the associated
bnpglm returns the following:Xpred is specified, the function returns the conditional expectation of the response given each new covariate $x$.Xpred is specified, the function returns the conditional median of the response given each new covariate $x$.Xpred is specified, the function returns the conditional first quantile of the response given each new covariate $x$.Xpred is specified, the function returns the conditional third quantile of the response given each new covariate $x$.Xpred is specified, the function returns the conditional mode of the response given each new covariate $x$.bnpglm creates files where the posterior samples are written. These files are (with all file names
preceded by `BNSP.'):(sweeps-burn)*ncomp lines and $p^2$ columns. In more detail, each sweep creates ncomp lines representing samples $T_h^{(sw)}, h=1,\dots,ncomp$, where superscript $sw$ represents a particular sweep. The elements of
$T_h^{(sw)}$ are written in the columns of the file: the entries in the first $p$ columns of the file are those in the first column (or row) of $T_h^{(sw)}$, while the entries in the last $p$ columns of the file are those in the last column (or row) of $T_h^{(sw)}$.(sweeps-burn)*ncomp lines and $p^2$ columns. In more detail, each sweep creates ncomp lines representing samples $\Sigma_h^{(sw)}, h=1,\dots,ncomp$, where superscript $sw$ represents a particular sweep. The elements of
$\Sigma_h^{(sw)}$ are written in the columns of the file: the entries in the first $p$ columns of the file are those in the first column (or row) of $\Sigma_h^{(sw)}$, while the entries in the last $p$ columns of the file are those in the last column (or row) of $\Sigma_h^{(sw)}$.(sweeps-burn)*ncomp lines and $p^2$ columns. In more detail, each sweep creates ncomp lines representing samples $(\Sigma_h^{-1})^{(sw)}, h=1,\dots,ncomp$, where superscript $sw$ represents a particular sweep. The elements of $(\Sigma_h^{-1})^{(sw)}$ are written in the columns of the file: the entries in the first $p$ columns of the file are those in the first column (or row) of $(\Sigma_h^{-1})^{(sw)}$, while the entries in the last $p$ columns of the file are those in the last column (or row) of $(\Sigma_h^{-1})^{(sw)}$.(sweeps-burn)*ncomp lines and $p$ columns. In more detail, each sweep creates ncomp lines representing samples $\nu_h^{(sw)}, h=1,\dots,ncomp$, where superscript $sw$ represents a particular sweep. The elements of $\nu_h^{(sw)}$ are written in the columns of the file.(sweeps-burn)*ncomp lines and $p$ columns. In more detail, each sweep creates ncomp lines representing samples $\mu_h^{(sw)}, h=1,\dots,ncomp$, where superscript $sw$ represents a particular sweep. The elements of $\mu_h^{(sw)}$ are written in the columns of the file.(sweeps-burn)*ncomp lines and one column. Sweeps write in the file ncomp lines representing samples $\gamma_h^{(sw)}, h=1,\dots,ncomp$, where superscript $sw$ represents a particular sweep.(sweeps-burn) lines and one column, each line including one posterior sample.(sweeps-burn) lines, that represent the posterior samples, and $n$ columns, that represent the sampling units. Entries in this file range from 0 to $ncomp-1$.(sweeps-burn) lines and ncomp columns, where the lines represent posterior samples while the columns represent the components or clusters. The entries represent the number of sampling units allocated to the components.(sweeps-burn) lines with the number of components updated at each iteration of the sampler.Xpred.
The file has (sweeps-burn)*npred lines, where npred is the number of rows in Xpred. That is,
at each iteration of the sampler, one line for each `new' covariate vector $x$ is written. The columns of the file represent
the possible values of $Y$, staring from zero and continuing to a max number.bnpglm returns samples from the posterior distributions of the parameters of the model:
$$f(y_i,x_i) = \sum_{h=0}^{\infty} \pi_h f(y_i,x_i|\theta_h), \hspace{80pt} (1)$$
where $y_i$ is a univariate response with distribution that belongs in the exponential family,
$x_i$ is a $p$-dimensional vector of continuous covariates, and $\pi_h, h \geq 1,$ are obtained according to Sethuraman's (1994) stick-breaking construction: $\pi_1 = v_1$, and for $l \geq 2, \pi_l = v_l \prod_{j=1}^{l-1} (1-v_j)$, where $v_k$ are iid samples $v_k \sim$Beta$(1,\alpha), k \geq 1.$
The discrete responses $y_i$ are represented as discretized versions of continuous latent variables $y_i^*$. Observed discrete and continuous latent variables are connected by:
$$y_{i} = q \iff c_{h,q-1} < y^*_{i} < c_{h,q},$$
where the cut-points are obtained as: $c_{h,-1} = -\infty$,
while for $q \geq 0$, $c_{h,q} = c_{q}(\gamma_{h},H_i) = \Phi^{-1}{F(q;\gamma_{h},H_i)}.$ Here $\Phi(.)$
is the cumulative distribution function (cdf) of a standard normal variable,
and $F(.;\gamma,H)$ denotes an appropriate cdf. For instance, for modeling count data,
$F(.;\gamma,H)$ denotes the cdf of a Poisson$(H\gamma)$ variable, where $H$ denotes the offset term,
while for modeling Binomial data, $F(.;\gamma,H)$ denotes the cdf of a
Binomial$(H,\gamma)$ variable, where $H$ denotes the number of trials. Further, latent variables are assumed to independently follow a $y_i^* \sim N(0,1)$ distribution, where the mean and variance are restricted to be zero and one as they are non-identifiable by the data.
Joint vectors $(y_i^{*},x_{i})$ are modeled utilizing Gaussian distributions. Then, with $\theta_h$ denoting model parameters associated with the $h$th component, the joint density $f(y_{i},x_{i}|\theta_h)$ takes the form
$$f(y_{i},x_{i}|\theta_h) = \int_{c_{i,y_i-1}}^{c_{i,y_i}} N_{p+1}(y_{i}^{*},x_{i}|\mu_{h},C_h) dy_{i}^{*},$$
where $\mu_h$ and $C_h$ denote the mean vector and covariance matrix, respectively.
The joint distribution of the latent variable $y_i^{*}$ and the covariates $x_{i}$ is
$$(y_{i}^{*},x_{i}^T)^T|\theta_h \sim N_{p+1}\left(
\begin{array}{ll}
\left(
\begin{array}{l}
0 \
\mu_h \
\end{array}
\right),
&
C_h=\left[
\begin{array}{ll}
1 & \nu_h^T \
\nu_h & \Sigma_h \
\end{array}
\right]
\end{array}\right),$$
where $\nu_h$ denotes the vector of covariances cov$(y_{i}^{*},x_{i}|\theta_h)$.
Sampling from the posterior of constrained covariance matrix $C_h$
is done using methods similar to those of McCulloch et al. (2000).
Specifically, the conditional $x_{i}|y_{i}^{*} \sim
N_{p}(\mu_h+y_{i}^{*}\nu_h, B_h = \Sigma_h - \nu_h \nu_h^T)$ simplifies matters as there
are no constraints on matrix $B_h$ (other than positive definiteness).
Given priors for $B_h$ and $\nu_h$, it is easy to sample from their posteriors, and thus obtain samples from the posterior of $\Sigma_h=B_h+\nu_h \nu_h^T$.
Specification of the prior distributions:
Vand
Vdf.Mu.nuandSigma.nuallow the user to change the default values.Mu.muandSigma.muallow the user to change the default values.family=poisson, a priori we take$\gamma_{h} \sim$Gamma$(\alpha_{\gamma},\beta_{\gamma}), h \geq 1$.
The default values are$\alpha_{\gamma}=1.0,\beta_{\gamma}=0.1$, that define a Gamma distribution
with mean$\alpha_{\gamma}/\beta_{\gamma}=10$and variance$\alpha_{\gamma}/\beta_{\gamma}^2=100.$For binomial data, withfamily=binomial, a priori we take$\gamma_{h} \sim$Beta$(\alpha_{\gamma},\beta_{\gamma}), h \geq 1$.
The default values are$\alpha_{\gamma}=1.0,\beta_{\gamma}=1.0$, that define a uniform distribution.
Users can alter the default using using argumentsAlpha.gammaandBeta.gamma.Alpha.alpha,Beta.alphaandTurnc.alpha.# Bayesian nonparametric GLM with Binomial response Y and one predictor X
data(simD)
pred<-seq(with(simD,min(X))+0.1,with(simD,max(X))-0.1,length.out=30)
npred<-length(pred)
# fit1 and fit2 define the same model but with different numbers of
# components and posterior samples
fit1 <- bnpglm(cbind(Y,(E-Y))~X, family=binomial, data=simD, ncomp=30, sweeps=150,
burn=100, Xpred=pred, offsetPred=rep(30,npred))
fit2 <- bnpglm(cbind(Y,(E-Y))~X, family=binomial, data=simD, ncomp=50, sweeps=5000,
burn=1000, Xpred=pred, offsetPred=rep(30,npred))
plot(X,Y/E)
lines(pred,fit2$medianReg,col=3,lwd=2)Run the code above in your browser using DataLab