sim_bnb: Simulate data from a BNB distribution

Description

Simulate data from the bivariate negative binomial (BNB) distribution. The BNB distribution is used to simulate count data where the event counts are jointly dependent (correlated). For independent data, see sim_nb().

Usage

sim_bnb(
  n,
  mean1,
  mean2,
  ratio,
  dispersion,
  nsims = 1L,
  return_type = "list",
  max_zeros = 0.99,
  ncores = 1L
)

Value

If nsims = 1 and the number of unique parameter combinations is one, the following objects are returned:

If return_type = "list", a list:

Slot Name Description

1 Simulated counts from sample 1.

2 Simulated counts from sample 2.
If return_type = "data.frame", a data frame:

Column Name Description

1 item Subject/item indicator.

2 condition Sample/condition indicator.

3 value Simulated counts.

If nsims > 1 or the number of unique parameter combinations is greater than one, each object described above is returned in a list-column named data in a depower simulation data frame:

Column	Name	Description
1	`n1`	Sample size of sample 1.
2	`n2`	Sample size of sample 2.
3	`mean1`	Mean for sample 1.
4	`mean2`	Mean for sample 2.
5	`ratio`	Ratio of means (sample 2 / sample 1).
6	`dispersion`	Gamma distribution shape parameter (dispersion).
7	`nsims`	Number of valid simulation iterations.
8	`distribution`	Distribution sampled from.
9	`data`	List-column of simulated data.

Arguments

n

(integer: [2, Inf))
The number(s) of paired observations.

mean1

(numeric: (0, Inf))
The mean(s) of sample 1 $(\mu_1)$.

mean2, ratio

(numeric: (0, Inf))
Only specify one of these arguments.

mean2: The mean(s) of sample 2 $(\mu_2)$.
ratio: The ratio(s) of means for sample 2 with respect to sample 1 $\left( r = \frac{\mu_2}{\mu_1} \right)$.

mean2 = ratio * mean1

dispersion

(numeric: (0, Inf))
The gamma distribution shape parameter(s) $(\theta)$ which control the dispersion and the correlation between sample 1 and 2. See 'Details' and 'Examples'.

nsims

(Scalar integer: 1L; [1, Inf))
The expected number of simulated data sets. If nsims > 1, the data is returned in a list-column of a depower simulation data frame. nsims may be reduced depending on max_zeros.

return_type

(string: "list"; c("list", "data.frame"))
The data structure of the simulated data. If "list" (default), a list object is returned. If "data.frame" a data frame in tall format is returned. The list object provides computational efficiency and the data frame object is convenient for formulas. See 'Value'.

max_zeros

(Scalar numeric: 0.99; [0, 1])
The maximum proportion of zeros each group in a simulated dataset is allowed to have. If the proportion of zeros is greater than this value, the corresponding data is excluded from the set of simulations. This is most likely to occur when the sample size is small and the dispersion parameter is small.

ncores

(Scalar integer: 1L; [1,Inf))
The number of cores (number of worker processes) to use. Do not set greater than the value returned by parallel::detectCores(). May be helpful when the number of parameter combinations is large and nsims is large.

Details

The negative binomial distribution may be defined using a gamma-Poisson mixture distribution. In this case, the Poisson parameter $\lambda$ is a random variable with gamma distribution. Equivalence between different parameterizations are demonstrated below:

# Define constants and their relationships
n <- 10000
dispersion <- 8
mu <- 4
p <- dispersion / (dispersion + mu)
q <- mu / (mu + dispersion)
variance <- mu + (mu^2 / dispersion)
rate <- p / (1 - p)
scale <- (1 - p) / p
# alternative formula for mu
mu_alt <- (dispersion * (1 - p)) / p
stopifnot(isTRUE(all.equal(mu, mu_alt)))
set.seed(20240321)
# Using built-in rnbinom with dispersion and mean
w <- rnbinom(n = n, size = dispersion, mu = mu)
# Using gamma-Poisson mixture with gamma rate parameter
x <- rpois(
  n = n,
  lambda = rgamma(n = n, shape = dispersion, rate = rate)
)
# Using gamma-Poisson mixture with gamma scale parameter
y <- rpois(
  n = n,
  lambda = rgamma(n = n, shape = dispersion, scale = scale)
)
# Using gamma-Poisson mixture with multiplicative mean and
# gamma scale parameter
z <- rpois(
  n = n,
  lambda = mu * rgamma(n = n, shape = dispersion, scale = 1/dispersion)
)
# Compare CDFs
par(mar=c(4,4,1,1))
plot(
  x = sort(w),
  y = (1:n)/n,
  xlim = range(c(w,x,y,z)),
  ylim = c(0,1),
  col = 'green',
  lwd = 4,
  type = 'l',
  main = 'CDF'
)
lines(x = sort(x), y = (1:n)/n, col = 'red', lwd = 2)
lines(x = sort(y), y = (1:n)/n, col = 'yellow', lwd = 1.5)
lines(x = sort(z), y = (1:n)/n, col = 'black')

The BNB distribution is implemented by compounding two conditionally independent Poisson random variables $X_1 \mid G = g \sim \text{Poisson}(\mu g)$ and $X_2 \mid G = g \sim \text{Poisson}(r \mu g)$ with a gamma random variable $G \sim \text{Gamma}(\theta, \theta^{-1})$. The probability mass function for the joint distribution of $X_1, X_2$ is

$$ P(X_1 = x_1, X_2 = x_2) = \frac{\Gamma(x_1 + x_2 + \theta)}{(\mu + r \mu + \theta)^{x_1 + x_2 + \theta}} \frac{\mu^{x_1}}{x_1!} \frac{(r \mu)^{x_2}}{x_2!} \frac{\theta^{\theta}}{\Gamma(\theta)} $$

where $x_1,x_2 \in \mathbb{N}^{\geq 0}$ are specific values of the count outcomes, $\theta \in \mathbb{R}^{> 0}$ is the dispersion parameter which controls the dispersion and level of correlation between the two samples (otherwise known as the shape parameter of the gamma distribution), $\mu \in \mathbb{R}^{> 0}$ is the mean parameter, and $r = \frac{\mu_2}{\mu_1} \in \mathbb{R}^{> 0}$ is the ratio parameter representing the multiplicative change in the mean of the second sample relative to the first sample. $G$ denotes the random subject effect and the gamma distribution scale parameter is assumed to be the inverse of the dispersion parameter ($\theta^{-1}$) for identifiability.

Correlation decreases from 1 to 0 as the dispersion parameter increases from 0 to infinity. For a given dispersion, increasing means also increases the correlation. See 'Examples' for a demonstration.

See 'Details' in sim_nb() for additional information on the negative binomial distribution.

References

yu_2020depower

rettiganti_2012depower

aban_2009depower

Examples

Run this code

#----------------------------------------------------------------------------
# sim_bnb() examples
#----------------------------------------------------------------------------
library(depower)

# Paired two-sample data returned in a data frame
sim_bnb(
  n = 10,
  mean1 = 10,
  ratio = 1.6,
  dispersion = 3,
  nsims = 1,
  return_type = "data.frame"
)

# Paired two-sample data returned in a list
sim_bnb(
  n = 10,
  mean1 = 10,
  ratio = 1.6,
  dispersion = 3,
  nsims = 1,
  return_type = "list"
)

# Two simulations of paired two-sample data
# returned as a list of data frames
sim_bnb(
  n = 10,
  mean1 = 10,
  ratio = 1.6,
  dispersion = 3,
  nsims = 2,
  return_type = "data.frame"
)

# Two simulations of Paired two-sample data
# returned as a list of lists
sim_bnb(
  n = 10,
  mean1 = 10,
  ratio = 1.6,
  dispersion = 3,
  nsims = 2,
  return_type = "list"
)

#----------------------------------------------------------------------------
# Visualization of the BNB distribution as dispersion varies.
#----------------------------------------------------------------------------
set.seed(1234)
data <- lapply(
  X = c(1, 10, 100, 1000),
  FUN = function(x) {
    d <- sim_bnb(
      n = 1000,
      mean1 = 10,
      ratio = 1.5,
      dispersion = x,
      nsims = 1,
      return_type = "data.frame"
    )
    cor <- cor(
      x = d[d$condition == "1", ]$value,
      y = d[d$condition == "2", ]$value
    )
    cbind(dispersion = x, correlation = cor, d)
  }
)

data <- do.call(what = "rbind", args = data)

ggplot2::ggplot(
  data = data,
  mapping = ggplot2::aes(x = value, fill = condition)
) +
  ggplot2::facet_wrap(
    facets = ggplot2::vars(.data$dispersion),
    ncol = 2,
    labeller = ggplot2::labeller(.rows = ggplot2::label_both)
  ) +
  ggplot2::geom_density(alpha = 0.3) +
  ggplot2::coord_cartesian(xlim = c(0, 60)) +
  ggplot2::geom_text(
    mapping = ggplot2::aes(
      x = 30,
      y = 0.12,
      label = paste0("Correlation: ", round(correlation, 2))
    ),
    check_overlap = TRUE
  ) +
  ggplot2::labs(
    x = "Value",
    y = "Density",
    fill = "Condition",
    caption = "Mean1=10, Mean2=15, ratio=1.5"
  )

Run the code above in your browser using DataLab

Slot	Name	Description
1		Simulated counts from sample 1.
2		Simulated counts from sample 2.

Column	Name	Description
1	`item`	Subject/item indicator.
2	`condition`	Sample/condition indicator.
3	`value`	Simulated counts.