dpweib: Dirichlet process mixture/Dependent Dirichlet process model for survival/competing risks data

Description

Use Dirichlet process mixture/dependent Dirichlet process Weibull model for survival data with/without competing risks. When regression covariates are present, the model is a dependent Dirichlet process model. For competing risks data we only consider two potential causes of events and the user can combine events of secondary interests. In competing risks regression model, the estimates provided focus on the primary cause (cause 1), and the user can switch the event indicator to get the estimates for the secondary cause.

Usage

dpweib(formula,data, high.pct = NULL, predtime = NULL, comp = FALSE,
alpha = 0.05, simultaneous = FALSE, burnin = 8000, iteration = 2000,
alpha00 = 1.354028, alpha0 = 0.03501257, lambda00 = 7.181247,
alphaalpha = 0.2, alphalambda = 0.1, a = 1, b = 1, gamma0 = 1, 
gamma1 = 1, thin = 10, betasl = 2.5, addgroup = 2)

Arguments

formula

A formula written in regular $y \sim x_1+x_2+ \dots +x_p$ regression format. $y$ is a Surv object for survival data (including interval censored data) and Hist object for competing risks data. The regression covaraites can be continuous or factors. Since the model is flexible enough, interaction terms are not necessary.

data

an optional data frame, list or environment (or object coercible by as.data.frame to a data frame) containing the variables in the model. If not found in data, the variables are taken from environment(formula), typically the environment from which dpweib is called.

high.pct

The estimated high percentile (95th) percentile of the data-generating distribution of the average population given by the user. If the user does not provide this value, we will look into the data. If there is no censoring, we take the 95th percentile of the observed data. If censoring takes less than 15% of the total observations, we use the maximum of the observed time. If the censoring takes more than 15%, we suggest a scaling parameter by first finding the time $t$ corresponding to the observed survival rate at the end of study from the plot of the median of the components (survmedian) generated by our LIO prior on a 0 to 10 scale, then set the scaling parameter to be the largest observation time multiplied by 10/t.

predtime

A vector given by the user to specify the time points where the inferences will be made. If the user does not provide it, we will take 40 time points located evenly from the beginning to the high.pct.

comp

A logical value indicating whether this is competing risks data or not. The default is FALSE.

alpha

$1-\alpha$ is the probability for constructing credible intervals. The default $\alpha$ is 0.05.

simultaneous

A logical value indicating whether to provide simultaneous credible intervals. The default is FALSE.

burnin

Number of burnin iterations. The default is 5000.

iteration

Number of iterations. The default is 5000.

alpha00

Parameter for the base distribution of $\lambda$ in non-competing risks data model and $\lambda_1$, $\lambda_2$ in competing risks data model. The default is 1.354028.

alpha0

Parameter for the base distribution of $\lambda$ in non-competing risks data model and $\lambda_1$, $\lambda_2$ in competing risks data model. The default is 0.03501257.

lambda00

Parameter for the base distribution of $\lambda$ in non-competing risks data model and $\lambda_1$, $\lambda_2$ in competing risks data model. The default is 7.181247.

alphaalpha

Parameter for the base distribution of $\alpha$ in non-competing risks data model and $\alpha_1$, $\alpha_2$ in competing risks data model. The default is 0.2.

alphalambda

Parameter for the base distribution of $\alpha$ in non-competing risks data model and $\alpha_1$, $\alpha_2$ in competing risks data model. THe default value is 0.1.

Parameter for the gamma prior of the concentration parameter of DP. The default is 1.

gamma0

Parameter for the base distribution of p in competing risks data model. The default value is 1.

gamma1

Parameter for the base distribution of p in competing risks data model. The default value is 1.

thin

Thinning. The default value is 10.

betasl

Parameter for the base distribution of the regression coefficients $\beta$ in non-competing risks data model and $\beta_1$ and $\beta_2$ in competing risks data model. The default value is 2.5.

addgroup

Number of new parameters proposed for each cluster assignment. The default is 2 (suggested by Neal).

Value

This function can generate 4 different kinds of output based on the data set given. They all share,

a vector, the cluster assignment in the last iteration, useful for the resumption of MCMC iteration

a vector, the number of observations in each cluster from the last iteration, useful for the resumption of MCMC iteration

emptybasket

only useful for the resumption of MCMC iteration

allbaskets

only useful for the resumption of MCMC iteration

ngrp

a vector, the number of clusters in each iteration, useful for the resumption of MCMC iteration

predtime

the time points where the inferences are made

high.pct

the scaling parameter of observations used in the model

usertime

a logic value, whether user provides time for estimation or not

$1-\alpha$ is the probability for constructing credible intervals.

simultaneous

Whether give simultaneous credible intervals.

For non-competing risks data, dpweib can generate two classes of output, dpm and ddp, for data with and without covariates separately. They both have

alpharec

a matrix, saved samples of $\alpha$s, the rows correspond to the iterations saved, the columns correspond to the observations

lambdarec

a matrix, saved samples of $\lambda$s, the rows correspond to the iterations saved, the columns correspond to the observations

lambda0rec

a matrix, saved samples of $\lambda_0$s, the rows correspond to the iterations saved, the columns correspond to the observations

lambdascaled

a matrix, saved samples of $\lambda$s under 0 to 10 scale, the rows correspond to the iterations saved, the columns correspond to the observations, only useful for the resumption of MCMC iteration

the left end point

the right end point

right censoring indicator

delta

exact observation indicator

For dpm output, it has

a matrix, the estimated survival function for each saved iteration, the columns correspond to time points, the rows correspond to saved iterations

Spred

a vector, the estimated survival function at specified time points

Spredu

a vector, the estimated pointwise upper credible interval for survival function at specified time points

Spredl

a vector, the estimated pointwise lower credible interval for survival function at specified time points

a matrix, the estimated density function for each saved iteration, the columns correspond to time points, the rows correspond to saved iterations

dpred

a vector, the estimated density function at specified time points

dpredu

a vector, the estimated pointwise upper credible interval for density function at specified time points

dpredl

a vector, the estimated pointwise lower credible interval for density function at specified time points

a matrix, the estimated hazard function for each saved iteration, the columns correspond to time points, the rows correspond to saved iterations

hpred

a vector, the estimated hazard function at specified time points

hpredu

a vector, the estimated pointwise upper credible interval for hazard function at specified time points

hpredl

a vector, the estimated pointwise lower credible interval for hazard function at specified time points

When simultaneous is specified TRUE, the function also provides

Sbandu

a vector, the estimated simultaneous upper credible interval for survival function at specified time points

Sbandl

a vector, the estimated simultaneous lower credible interval for survival function at specified time points

dbandu

a vector, the estimated simultaneous upper credible interval for density function at specified time points

dbandl

a vector, the estimated simultaneous lower credible interval for density function at specified time points

hbandu

a vector, the estimated simultaneous upper credible interval for hazard function at specified time points

hbandl

a vector, the estimated simultaneous lower credible interval for hazard function at specified time points

For ddp output, it also has

betarec

a matrix, saved samples of $\beta$s, which is consist of horizontal-merged blocks. One block corresponds to one observation. Inside each block, the rows correspond to the iterations saved, the columns correspond to the covariates.

the covariate matrix

xmean

a vector, the mean for each covariate(including created binary dummy covariates)

xsd

a vector, the standized deviation for each covariate, if the covariate is binary, then it is set to be 0.5.(including created binary dummy covariates)

xscale

The matrix used to scale log hazard ratio

loghr

a matrix, the estimated log hazard ratio for each saved iteration, the columns correspond to time points, the rows correspond to saved iterations

loghr.est

a vector, the estimated log hazard ratio at specified time points

loghru

a vector, the estimated pointwise upper credible interval for log hazard ratio at specified time points

loghrl

a vector, the estimated pointwise lower credible interval for log hazard ratio at specified time points

indicator

a vector, whether a covariate is binary

covnames

a vector, the names of covariates

When simultaneous is specified TRUE, the function also provides

loghrbandu

a vector, the estimated simultaneous upper credible interval for log hazard ratio at specified time points

loghrbandl

a vector, the estimated simultaneous lower credible interval for log hazard ratio at specified time points

For competing risks data, dpweib can generate two classes of output, dpmcomp and ddpcomp, for data with and without covariate separately. They both have

alpharec1

a matrix, saved samples of $\alpha_1$s, the rows correspond to the iterations saved, the columns correspond to the observations

lambdarec1

a matrix, saved samples of $\lambda_1$s, the rows correspond to the iterations saved, the columns correspond to the observations

lambda0rec1

a matrix, saved samples of $\lambda_{01}$s, the rows correspond to the iterations saved, the columns correspond to the observations

lambdascaled1

a matrix, saved samples of $\lambda_1$s under 0 to 10 scale, the rows correspond to the iterations saved, the columns correspond to the observations, only useful for the resumption of MCMC iteration

alpharec2

a matrix, saved samples of $\alpha_2$s, the rows correspond to the iterations saved, the columns correspond to the observations

lambdarec2

a matrix, saved samples of $\lambda_2$s, the rows correspond to the iterations saved, the columns correspond to the observations

lambda0rec2

a matrix, saved samples of $\lambda_{02}$s, the rows correspond to the iterations saved, the columns correspond to the observations

lambdascaled2

a matrix, saved samples of $\lambda_2$s under 0 to 10 scale, the rows correspond to the iterations saved, the columns correspond to the observations, only useful for the resumption of MCMC iteration

prec

a matrix, saved samples of $p$, the rows correspond to the iterations saved, the columns correspond to the observations

the observed time

event

the event indicator

For dpmcomp output, it has

CIF1

a matrix, the estimated cumulative incidence function for cause 1 for each saved iteration, the columns correspond to time points, the rows correspond to saved iterations

CIF1.est

a vector, the estimated cumulative incidence function of cause 1 at specified time points

CIF1u

a vector, the estimated pointwise upper credible interval for cumulative incidence function of cause 1 at specified time points

CIF1l

a vector, the estimated pointwise lower credible interval for cumulative incidence function of cause 1 at specified time points

a matrix, the estimated cause-specific density function for cause 1 for each saved iteration, the columns correspond to time points, the rows correspond to saved iterations

d1.est

a vector, the estimated cause-specific density function of cause 1 at specified time points

d1u

a vector, the estimated pointwise upper credible interval for cause-specific density function of cause 1 at specified time points

d1l

a vector, the estimated pointwise lower credible interval for cause-specific density function of cause 1 at specified time points

a matrix, the estimated subdistribution hazard function for cause 1 at specified time points, the columns correspond to time points, the rows correspond to saved iterations

h1.est

a vector, the estimated subdistribution hazard function of cause 1 at specified time points

h1u

a vector, the estimated pointwise upper credible interval for subdistribution hazard function of cause 1 at specified time points

h1l

a vector, the estimated pointwise lower credible interval for subdistribution hazard function of cause 1 at specified time points

CIF2

a matrix, the estimated cumulative incidence function for cause 2 for each saved iteration, the columns correspond to time points, the rows correspond to saved iterations

CIF2.est

a vector, the estimated cumulative incidence function of cause 2 at specified time points

CIF2u

a vector, the estimated pointwise upper credible interval for cumulative incidence function of cause 2 at specified time points

CIF2l

a vector, the estimated pointwise lower credible interval for cumulative incidence function of cause 2 at specified time points

a matrix, the estimated cause-specific density function for cause 2 for each saved iteration, the columns correspond to time points, the rows correspond to saved iterations

d2.est

a vector, the estimated cause-specific density function of cause 2 at specified time points

d2u

a vector, the estimated pointwise upper credible interval for cause-specific density function of cause 2 at specified time points

d2l

a vector, the estimated pointwise lower credible interval for cause-specific density function of cause 2 at specified time points

a matrix, the estimated subdistribution hazard function for cause 2 for each saved iteration, the columns correspond to time points, the rows correspond to saved iterations

h2.est

a vector, the estimated subdistribution hazard function of cause 2 at specified time points

h2u

a vector, the estimated pointwise upper credible interval for subdistribution hazard function of cause 2 at specified time points

h2l

a vector, the estimated pointwise lower credible interval for subdistribution hazard function of cause 2 at specified time points

When simultaneous is specified TRUE, the function also provides

CIF1bandu

a vector, the estimated simultaneous upper credible interval for cumulative incidence function of cause 1 at specified time points

CIF1bandl

a vector, the estimated simultaneous lower credible interval for cumulative incidence function of cause 1 at specified time points

d1bandu

a vector, the estimated simultaneous upper credible interval for cause-specific density function of cause 1 at specified time points

d1bandl

a vector, the estimated simultaneous lower credible interval for cause-specific density function of cause 1 at specified time points

h1bandu

a vector, the estimated simultaneous upper credible interval for subdistribution hazard function of cause 1 at specified time points

h1bandl

a vector, the estimated simultaneous lower credible interval for subdistribution hazard function of cause 1 at specified time points

CIF2bandu

a vector, the estimated simultaneous upper credible interval for cumulative incidence function of cause 2 at specified time points

CIF2bandl

a vector, the estimated simultaneous lower credible interval for cumulative incidence function of cause 2 at specified time points

d2bandu

a vector, the estimated simultaneous upper credible interval for cause-specific density function of cause 2 at specified time points

d2bandl

a vector, the estimated simultaneous lower credible interval for cause-specific density function of cause 2 at specified time points

h2bandu

a vector, the estimated simultaneous upper credible interval for subdistribution hazard function of cause 2 at specified time points

h2bandl

a vector, the estimated simultaneous lower credible interval for subdistribution hazard function of cause 2 at specified time points

For ddpcomp output, it also has

betarec1

a matrix, saved samples of $\beta_1$s, which is consist of horizontal-merged blocks. One block corresponds to one observation. Inside each block, the rows correspond to the iterations saved, the columns correspond to the covariates.

betarec2

a matrix, saved samples of $\beta_2$s, which is consist of horizontal-merged blocks. One block corresponds to one observation. Inside each block, the rows correspond to the iterations saved, the columns correspond to the covariates.

xmean

a vector, the mean for each covariate(including created dummy covariates)

xsd

a vector, the standized deviation for each covariate, if the covariate is binary, then it is set to be 0.5(including created dummy covariates).

the covariate matrix

xscale

The matrix used to scale log hazard ratio

covnames

a vector, the names of covariates

loghr.est

the estimated log subdistribution hazard ratio at specified time points for cause 1

loghru

the estimated pointwise upper credible interval for log subdistribution hazard ratio at specified time points for cause 1

loghrl

the estimated pointwise lower credible interval for log subdistribution hazard ratio at specified time points for cause 1

indicator

a vector, whether a covariate is binary

When simultaneous is specified TRUE, the function also provides

loghrbandu

a vector, the estimated simultaneous upper credible interval for log subdistribution hazard ratio at specified time points

loghrbandl

a vector, the estimated simultaneous lower credible interval for log subdistribution hazard ratio at specified time points

Details

For no regression, no competing risks data, the function dpweib implements dirichlet process Weibull mixture model. The basic form of model is the following. $$ \begin{array}{rl} y_i|\alpha_i,\lambda_i&\sim Weib(t_i|\alpha_i,\lambda_i),\quad i=1,...,n\\ (\alpha_i,\lambda_i)|G&\sim G,\quad i=1,...,n\\ G&\sim DP(G_0,\nu)\\ G_0&=Ga(\lambda|\alpha_0,\lambda_0) I_{(f(\lambda),\infty)}(\alpha) Ga(\alpha_{\alpha},\lambda_{\alpha})\\ \lambda_0&\sim Ga(\alpha_{00},\lambda_{00})\\ \nu&\sim Ga(a,b)\\ \end{array}$$ where$f(\lambda)=max(0,\log\{\log(20)/\lambda\}/\log(25))$.

For regression data without competing risks, the method is a mixture of Cox model.

$$ \begin{array}{rl} y_i|\alpha_i,\lambda_i,\boldsymbol{\beta_i}, \mathbf{Z_i}&\sim Weib(y_i|\alpha_i,\lambda_i\exp(\mathbf{Z_i^T}\boldsymbol{\beta_i})),\quad i=1,...,n\\ (\alpha_i,\lambda_i,\boldsymbol{\beta_i})|G&\sim G,\quad i=1,...,n\\ G&\sim DP(G_0,\nu)\\ G_0&=Ga(\lambda|\alpha_0,\lambda_0) I_{(f(\lambda),u)}(\alpha) Ga(\alpha_{\alpha},\lambda_{\alpha}) q(\boldsymbol{\beta})\\ \lambda_0&\sim Ga(\alpha_{00},\lambda_{00})\\ \nu&\sim Ga(a,b)\\ \end{array}$$

The density function corresponding to this Weibull notation is $ p(y_i|\alpha_i,\lambda_i)=\lambda_i\alpha_i y_i^{\alpha_i-1}e^{-\lambda_i y_i^{\alpha_i}},\quad y_i>0,\quad \alpha_i>0,\quad \lambda_i>0. $ $[x]=Ga(\alpha,\lambda)$ denotes that the density function of $x$ is $\displaystyle\frac{\lambda^{\alpha}}{\Gamma(\alpha)}x^{\alpha-1}e^{-\lambda x}$, $\alpha>0$, $\lambda>0$, $x>0$. $q(\beta)$ is the base distribution for regression coefficients.The details of the choice of base distribution is described in our coming paper.

In competing risks data, the likelihood for each individual can be written as$$ L=\{f_1(t_i)\}^{I(c_i=1)}\{f_2(t_i)\}^{I(c_i=2)}\{1-F_1(t_i)-F_2(t_i)\}^{I(c_i=0)},$$ where $f_1(\cdot)$ and $f_2(\cdot)$ are the cause-specific density functions for cause 1 and 2 and survival function for the $i$th observation can be expressed as $1-F_1(t_i)-F_2(t_i)$. In order to model it, we introduce a parameter $p$, which is the cumulative incidence function of primary cause at $\infty$, $p=F_1(\infty)$. The likelihood can be written as $$ L=\{pd_1(t_i)\}^{I(c_i=1)}\{(1-p)d_2(t_i)\}^{I(c_i=2)}\{1-pD_1(t_i)-(1-p)D_2(t_i)\}^{I(c_i=0)} .$$

Here the $D_{1}$, $D_{2}$, $d_{1}$, $d_{2}$ are the normalized baseline cumulative incidence functions and cause-specific density functions and are modeled with Weibull mixtures as above, while $p$ is the normalizing parameter for the baseline distribution. When regression covariates are present in a competing risks data, we modify the above likelihood with respect to the value of covaraites, such that $$ F_1(t|\mathbf{Z},\boldsymbol{\beta_1},p) = 1-(1-pD_{01}(t))^{\exp(\mathbf{Z^T}\boldsymbol{\beta_1})}.$$ The cause-specific density function for cause 1 is$$ f_1(t|\mathbf{Z},\boldsymbol{\beta_1},p)=\exp(\mathbf{Z^T}\boldsymbol{\beta_1})[1-pD_{01}(t)]^{\exp(\mathbf{Z^T}\boldsymbol{\beta_1})-1}pd_{01}(t).$$ The model for the secondary cause is defined as $$F_2(t|\mathbf{Z},\boldsymbol{\beta_1},\boldsymbol{\beta_2},p)=(1-p)^{\exp(\mathbf{Z^T}\boldsymbol{\beta_1})} (1-(1-D_{02}(t))^{\exp(\mathbf{Z^T}\boldsymbol{\beta_2})}), $$ which leads to the cause-specific subdensity function for cause 2 as$$ f_2(t|\mathbf{Z},\boldsymbol{\beta_2},p)=(1-p)^{\exp(\mathbf{Z^T}\boldsymbol{\beta_1})}(1-D_{02}(t))^{\exp(\mathbf{Z^T}\boldsymbol{\beta_2})-1}\exp(\mathbf{Z^T}\boldsymbol{\beta_2})d_{02}(t).$$

Examples

Run this code

# NOT RUN {
library(survival)
library(DPWeibull)
data(veteran)

DPresult1<-dpweib(Surv(time,status)~1,data=veteran)
summary(DPresult1)
opar<-par(mfrow=c(1,3),
          mar=c(3.1, 3.1, 3.1, 5.1),
          mgp=c(2, 0.5, 0),
          oma=c(0, 0, 0, 4))
plot(DPresult1)
par(opar)

DPresult2<-dpweib(Surv(time,status)~factor(trt)+age,data=veteran)
summary(DPresult2)
opar<-par(mfrow=c(1,2),
          mar=c(3.1, 3.1, 3.1, 5.1),
          mgp=c(2, 0.5, 0),
          oma=c(0, 0, 0, 4))
plot(DPresult2)
par(opar)

newdata<-NULL
newdata$trt<-veteran$trt[c(1,70)]
newdata$age<-veteran$age[c(2,87)]
newdata<-data.frame(newdata)
DPpredict<-predict(DPresult2,newdata)
summary(DPpredict)
opar<-par(mfrow=c(2,3),
          mar=c(3.1, 3.1, 3.1, 5.1),
          mgp=c(2, 0.5, 0),
          oma=c(0, 0, 0, 4))
plot(DPpredict)
par(opar)

############################################################################
# Competing Risks Data
# Competing Risks Data
library(survival)
library(prodlim)
library(riskRegression)
library(DPWeibull)
data(Paquid)

Paquid<-Paquid[1:500,]
DPresult1<-dpweib(Hist(time, status)~1,data=Paquid,
                  predtime = seq(from=min(Paquid$time),to=max(Paquid$time),length=200))
opar<-par(mfrow=c(1,3),
          mar=c(3.1, 3.1, 3.1, 5.1),
          mgp=c(2, 0.5, 0),
          oma=c(0, 0, 0, 4))
plot(DPresult1)
par(opar)

DPresult2<-continue(DPresult1,simultaneous=TRUE)
summary(DPresult2)

DPresult3<-dpweib(Hist(time, status)~DSST+MMSE,data=Paquid,
                  predtime = seq(from=min(Paquid$time),to=max(Paquid$time),length=200))
summary(DPresult3)
opar<-par(mfrow=c(1,2),
          mar=c(3.1, 3.1, 3.1, 5.1),
          mgp=c(2, 0.5, 0),
          oma=c(0, 0, 0, 4))
plot(DPresult3)
par(opar)

newdata<-NULL
newdata$DSST<-Paquid$DSST[c(1,70)]
newdata$MMSE<-Paquid$MMSE[c(2,87)]
newdata<-data.frame(newdata)

DPpredict<-predict(DPresult3,newdata)
summary(DPpredict)
opar<-par(mfrow=c(2,3),
          mar=c(3.1, 3.1, 3.1, 5.1),
          mgp=c(2, 0.5, 0),
          oma=c(0, 0, 0, 4))
plot(DPpredict)
par(opar)

###############################################################

# An example of interval censored data
library(KMsurv)
library(survival)
library(DPWeibull)
data("bcdeter")

DPresult<-dpweib(Surv(lower, upper, type="interval2") ~ treat, data = bcdeter)
summary(DPresult)
plot(DPresult)
# }

Run the code above in your browser using DataLab