# Bayesian Marginal Reconstruction

Suppose we are able summarize the current state of scientific knowledge for the mean/proportion of each of several endpoints for a particular treatment using a distribution. The distribution for each endpoint may come from the elicitation of prior information from experts, some initial dataset or some combination of different sources of information. In many cases it will be difficult to arrive at a joint distribution. Only the marginal distribution of each endpoint will be calculated easily. This is particularly true when using external published data where only marginal effects are known, and no patient level data is available. In general assuming independence for the distribution of the mean for each endpoint is not appropriate, as we would expect correlations in the distribution given that many endpoints are correlated. The method described here may be used to simulate from the approximate joint distribution given the marginal distribution and an individual level data set. The correlation structure within the individual level data is used to impute the joint distribution. The method also provides a way to simulate virtual trial data based on the marginal.

# A simulated example dataset

As and example, we simulate the following simple dataset with a continuous and two binary variables.

library(tboot) set.seed(2020) quant1 <- rnorm(200) + 1 bin1 <- ifelse( (.5*quant1 + .5*rnorm(200)) > .5, 1, 0) bin2 <- ifelse( (.5*quant1 + .5*rnorm(200)) > .5, 1, 0) simData <- data.frame(quant1, bin1, bin2) head(simData)

# Example

First, we create a list with simulations from the marginal distribution of each variable for two different treatments (active treatment and placebo).

marginal_active <- list(quant1=rnorm(5000, mean=.5, sd=.2), bin1=rbeta(5000, shape1 = 50,shape2=50), bin2=rbeta(5000, shape1 = 60,shape2=40)) marginal_pbo <- list(quant1=rnorm(5000, mean=.2, sd=.2), bin1=rbeta(5000, shape1 = 20,shape2=80), bin2=rbeta(5000, shape1 = 30,shape2=70))

We next need to use 'tweights_bmr' to calculate the correlation matrix from the data and get set for marginal reconstruction. The calculation uses a call to the 'tweights' function.

bmr_active <- tweights_bmr(dataset = simData, marginal = marginal_active) bmr_pbo <- tweights_bmr(dataset = simData, marginal = marginal_pbo)

To simulate from the posterior we use 'post_bmr':

samples <- rbind(data.frame(trt="active", post_bmr(nsims=1e3, bmr_active)), data.frame(trt="pbo", post_bmr(nsims=1e3, bmr_pbo))) head(samples)

The posterior samples show a correlations structure.

pairs(samples[,-1], col=ifelse(samples=="active","red","blue"), pch='.', cex=.5)

Marginally the posterior samples are equivalent to the simulations used as input (i.e., in the 'marginal' parameter).

## Justifying the algorithm

The algorithm described above may be justified in several ways. First, it is heuristically plausible. One would expect at first thought that when two variables are correlated, a drug which influences one of the variables will most likely influce the other. Second, in some specific cases, the algorythm may be justified via Bayesian Assymptotics using the 'Berstein Von-Misus' theorem. This document will not attempt to fully work out this more theoretical approach.

## Considering the limits of 'tboot_bmr'

The following considerations should be relevent when considering the use of 'tboot_bmr:'

1. Is the relationship between variables found in the available individual level data generalizable to the treatment of interest? That is if the individual data is tilted to reflect the expected mean of the treatment of interest, will the correlation be realisticly similar to the correlation of variable in the treatment of interest. In general, it is expected that the assumptions of 'tboot' will be more believable than the assumption of independence.
2. Is the individual level data sample size large enough to make inference about correlation?
3. Did the information about each variable come from different trials? In such cases it may be argued that for large samples sizes the distribution should be independent.