saBayes: Fits a non-parametric Bayesian model to source attribution data

Description

This function fits a non-parametric source attribution model for human cases of disease. It supports multiple types, sources, times and locations. The number of human cases for each type, time and location can follow a Poisson or Negative Binomial likelihood. $$y_{itl}\sim\textsf{Poisson}(\lambda_{itl})$$ or $$y_{itl}\sim\textsf{Negative Binomial}(\lambda_{itl},d)$$

Usage

saBayes(formula, time, location, type, data, priors, alpha_conc, prev, 
        likelihood_dist, n_iter, mcmc_params, initials, params_fix)

Arguments

formula

A formula object of the form y ~ x1 + x2 + ... + xm, where y is the name of the human cases column, and x1, ..., xm are the names of the source count columns in the data.

time

A formula object of the form ~t, where t is the name of the column containing the times in the data.

location

A formula object of the form ~l, where l is the name of the column containing the locations in the data.

type

A formula object of the form ~s, where s is the name of the column containing the (sub) types in the data.

data

Data frame with columns for human cases, source counts (one column for each source), time, location, and type. The data for the human cases and source counts must be integers. The data for the time, location and type columns must be factors. The source co

priors

A list giving the parameters of the priors for each parameter in the model with names a, q, alpha, r, and d. The priors for the source effect parameters and the relative prevalence parameter

alpha_conc

A single positive number giving the concentration parameter for the base distribution of the type effects: DP(Gamma(shape, rate), alpha_conc). See the vignette for further guidence on how to choose the concentration parameter.

An optional data frame with columns giving the prevalences (named value) and the times (named the name of the time column in the data). Prevalences must be between 0 and 1. Prevalences are the proportion of samples that were positive for any

likelihood_dist

Used to specify the likelihood distribution of the number of human cases from each type. Only Poisson ("pois") and Negative Binomial ("nbinom") are currently implemented. The model cannot have type effects if using a Negative Bin

n_iter

The total number of iterations if MCMC chain will run for.

mcmc_params

An optional list containing parameters relating to the mcmc chain. lll{ Item Description Default save_lambda If TRUE the lambda i and j parameters

initials

An optional list giving the starting values for the parameters. lll{ Parameter Description Default r A single positive number or a data frame with columns giving the prior values (named

params_fix

An optional list specifying whether to fix the r matrices (r), by default it is set to FALSE.

Value

Returns a list containing lists named posterior and acceptance_rate.
posteriorList with items named a, q, cluster, theta, r, d, li and lj.
lll{ Parameter Format and nesting Source effects (a) A nested list with a matrix (dimensions (n_iter, sources)) for each time and location. eg posterior$a[[t]][[l]][n,j] gives the value for a j for time t, location l at iteration n. Always returned. Type effects (q) A matrix (dimensions (n_iter, types)) containing the theta value for the cluster g that the type effect {qi} is assigned to in iteration n. eg posterior$q[n,i] gives the value for q i at iteration n. Always returned. Type effect cluster (cluster) A matrix containing the group each type effect was assigned to at each iteration (dimensions (n_iter, types)). eg posterior$cluster[n,i] gives the cluster for q i at iteration n. Returned if likelihood_dist == "pois". Theta (theta) A matrix containing the type effect parameters for each group (dimensions (n_iter, gmax)). eg posterior$theta[n,c] gives the theta value for cluster c at iteration n. Returned if likelihood_dist == "pois". r A nested list containing an array (with dimensions(types, sources, n_iter)) for each time. eg posterior$r[[t]][i,j,n] gives the value for r (type i, source j) at time t, iteration n. Returned if fix_r == FALSE. Dispersion parameter (d) A vector. Returned if likelihood_dist == "nbinom". Rate of human infection from type i (Lambda i: li) A nested list with a matrix (dimensions (n_iter, types)) for each time and location. eg posterior$li[[t]][[l]][n,i] gives the value for li i for time t, location l at iteration n. Returned if mcmc_params$save_lambda == TRUE. Rate of human infection from source j (Lambda j: lj) A nested list with a matrix (dimensions (n_iter, sources)) for each time and location. eg posterior$lj[[t]][[l]][n,j] gives the value for lj j for time t, location l at iteration n. Returned if mcmc_params$save_lambda == TRUE. }
acceptance_rateList with items named a, r, and d.
lll{ Parameter Format and nesting Source effects (a) A nested list with a positive numerical vector for each time and location. eg acceptance_rate$a_centered[[t]][[l]][3] gives the acceptance rate for a update for the third source effect at time t, location l. Always returned. These should all be around 0.45. r A nested list containing a matrix (with dimensions(types, sources)) for each time. eg acceptance_rate$r[[t]][i,j] gives the acceptance rate for r (type i, source j) at time t. Only returned if fix_r == FALSE. These should all be around 0.45. Dispersion parameter (d) positive numerical vector of length 1. Only returned if likelihood_dist == "nbinom" This should be around 0.45. }

Details

This function fits a source attribution model for human cases of disease. It supports multiple types, sources, times and locations. The number of human cases for each type, time and location follows a Poisson or Negative Binomial likelihood.

Model

$$y_{itl}\sim\textsf{Poisson}(\lambda_{itl})$$ or $$y_{itl}\sim\textsf{Negative Binomial}(\lambda_{itl},d)$$ where $$\lambda_{itl}=\sum_{j=1}^{m}\lambda_{ijtl}=q_{k(i)}\sum_{j=1}^{m}r_{ijt}\cdot \pi_{j}\cdot a_{jtl}$$

The parameters are defined as follows: $$a_{jtl}$$ is the unknown source effect for source $j$, time $t$, location $l$ $$q_{k(i)}$$ is the unknown type effect for type $i$ in group $k$. $$x_{ij}$$ is the known number of positive samples for each source $j$ type$i$ combination $$n_{ij}$$ is the known total number of samples for each source $j$ type $i$ combination $$\pi_{j}$$ is the fixed prevalence in source (i.e. the number of positive samples divided by the number of negative samples) $j$ $$r_{ijt}$$ is the unknown relative occurrence of type $i$ on source $j$. A point estimate of $r_{ijt}$ (which is used as the starting value for the MCMC chain) is $$r_{ijt}=x_{ij}/\sum_{i=1}^{n} x_{ij}$$

Priors $$x_{.j}\sim Multinomial(n_{j},p_{.j})$$ $$r_{.j}\sim Dirichlet(k_1,...,k_n)$$ $$a\sim Dirichlet(u_1,...,u_n)$$ $$q\sim DP(alpha, Gamma(alpha_{q},beta_{q}))$$

Guidence on choosing priors and the concentration parameter for the DP:

Dirichlet priors for the source effects and relative prevalences (r): symmetric Dirichlet priors (with a concentration parameter equal to 1) are often used when there is no prior knowledge favouring one component over another (it is uniform over all points in its support).

Dirichlet process parameters: a DP is specified by a base distribution (Gamma in this model) and a positive real number alpha_conc called the concentration parameter. The base distribution is the expected value of the process; the concentration parameter specifies how strong the prior grouping is. In the limit alpha_conc tends to 0, all types will be assigned to one group, increasing alpha_conc makes a larger number of groups increasingly likely. A relatively uninformative base distribution can be specified with parameters c(0.01, 0.00001).

See the vignette for further details about choosing priors.

Examples

Run this code

##########################################################################
## Access simulated data set #############################################
##########################################################################
#require(sourceR)
data(sim_SA)

##########################################################################
## Set priors ############################################################
##########################################################################

priors <- list(a = 1, r = 1, theta = c(0.01, 0.00001))

##########################################################################
## Run model #############################################################
##########################################################################
 
res <- saBayes(formula = Human~Source1+Source2+Source3+Source4+Source5, 
               time=~Time, location=~Location, type=~Type,
               data=sim_SA$data, priors = priors,
               alpha_conc = 1, prev = sim_SA$prev,
               likelihood_dist = "pois", n_iter = 20)
               
## or
## Run model using all parameters 
res <- saBayes(formula = Human~Source1+Source2+Source3+Source4+Source5, 
               time=~Time, location=~Location, type=~Type,
               data=sim_SA$data, priors = priors,
               alpha_conc = 1, prev = sim_SA$prev,
               likelihood_dist = "pois", n_iter = 5000,
               params_fix = list(type_effects = FALSE, r = FALSE),
               mcmc_params = list(save_lambda = TRUE, burn_in = 1, 
               thin = 1, n_r = 200))
               
##########################################################################
#### Results #############################################################
##########################################################################

## Example: Look at data #################################################
res$posterior$cluster

## Example: Convert rates into proportions, example lambda j #############

lambda_j_proportion_t1_l1 <- apply(res$posterior$lj[[1]][[1]], 1, 
                                    function(x) x/sum(x))

## Example: Trace plot ###################################################
# trace plot for lambda (type = 3, time = 1, location = 2)
plot(res$posterior$li$time1$locationB[, "3"], type="l")
## or
plot(res$posterior$li[[1]][[2]][, 3], type="l")

## Example: Density plot #################################################
# density plot for type effect (type = 3)
plot(density(res$posterior$q[,"3"]))

## Example: Residual plot ################################################

summary_li <- summary(res)$li
median_li <- c(summary_li$time1$locationA[, "median"], 
               summary_li$time1$locationB[, "median"],
               summary_li$time2$locationA[, "median"],
               summary_li$time2$locationB[, "median"])

plot(median_li, sim_SA$data[, "Human"], 
     xlab="lambda i", ylab="Human cases")
     abline(0,1)
  
## Example: Box plots showing the type effect grouping ###################
  require(ggplot2)
  library(cluster)
  # subset your data by removing a burn in (of 1000) from the total number 
  # of iterations (5000) and thinning to have a maximum of 10000 values as 
  # this can be very slow
  sub_hclust <- seq(1000, 5000, by=ceiling(5000/10000))
  groups <- as.data.frame(t(res$posterior$cluster)) 
  groups <- as.data.frame(apply(groups, 2, function(x) as.factor(x)))
  
  # compute dissimilarity matrix for the type effect clusters
  disim_clust_g <- daisy(groups[,sub_hclust])
  clu <- hclust(disim_clust_g, method="complete")
  # hclust_groups contains the group id for each type effect, 
  # if they were split into 3 groups using the tree generated by the
  # hclust command above
  hclust_groups <- cutree(clu, k=3) 
  
  # Put the data in the format ggplot2 requires it to be in
  # i.e. a column for the values for q at each iteration (thinned)
  # a column for the factor name (type i) so that it will make boxplots
  # and a column for the group
  q_full <- data.frame(qi=c(as.vector(res$posterior$q[sub_hclust,])), 
    i=as.factor(rep(1:50, each=dim(res$posterior$q[sub_hclust,])[1])), 
    group=as.factor(rep(as.vector(hclust_groups), 
    each=dim(res$posterior$q[sub_hclust,])[1])))
                                         
  g <- ggplot(q_full, aes(x=i,y=qi, fill=group)) + 
    geom_boxplot(outlier.size=0.5) +
    theme_classic()+ ylab("density") + xlab("Type effect i")
  print(g)
  
## Example: Dendrogram of type effect grouping ###########################
library(gplots)
  library(cluster)
  
  # subset your data by removing a burn in (of 1000) and thinning to 
  # have a maximum of 10000 values as this can be very slow
  sub_hclust <- seq(1000, 5000, by=ceiling(5000/10000))
  groups <- as.data.frame(t(res$posterior$cluster)) 
  groups <- as.data.frame(apply(groups, 2, function(x) as.factor(x)))
  rownames(groups) <- 1:50
  
  # compute dissimilarity matrix for the type effect clusters
  disim_clust_g <- daisy(groups[,sub_hclust])
  clu <- hclust(disim_clust_g, method="complete") 
  dend <- as.dendrogram(clu) 
  
  # OPTIONAL: change the colour of the heatmap. The lighter the colour,
  # the less likely two type effects are in the same group
  hmcols <- colorRampPalette(c("blue","white"))(299)
  # OPTIONAL: change the layout of the heatmap and dendrogram
  lmat <- rbind(c(0,3),c(2,1),c(0,4))
  lwid <- c(0.1,4)
  lhei <- c(1.5,4,0.4)
  layout(mat=lmat, widths=lwid, heights=lhei)

  heatmap_data <- as.matrix(disim_clust_g)

  rownames(heatmap_data) <- 1:50
  colnames(heatmap_data) <- 1:50
  
  heatmap.2(heatmap_data,
            density.info="none",  # turns off density plot inside color legend
            trace="none",         # turns off trace lines inside the heat map
            col=hmcols,           # use on color palette defined earlier 
            dendrogram="col",     # only draw a row dendrogram
            Colv=dend, 
            Rowv=dend,
            lmat=lmat, lwid=lwid, lhei=lhei
  )