# Tilted Bootstrap

The tilted bootstrap is a weighted resampling technique. The goal is to take a reasonably realistic sample of the rows of dataset in such a way that the column means approximate a user-defined target. In other words, the user can control the marginal means while still preserving intricate relationships among the variables. The "Tilted Bootstrap" is related to the inferential methods proposed by Efron and others [@efron1981nonparametric]. However, the methods implemented in the 'tboot' package are used in a multivariate situation and intended for simulating future outcomes instead of making inference.

This vignette is a tutorial on the use of the 'tboot' and 'tweights' functions which may be used for 'scenario based' frequentist simulations. That is, the user must specify the exact mean for each variable to perform the simulation. An alternative Bayesian approach where the user specifies a marginal distribution is also implemented in the 'tboot' package in the 'tboot_bmr,' 'post_bmr,' and 'tweights_bmr.' A seperate vignette is available as a tutorial for the Bayesian approach.

# A simulated example dataset

As and example, we simulate the following simple dataset with some categorical, continuous and binary variables.

library(tboot) set.seed(2018) color <- sample(c("brown", "green", "blue"), 300, replace = TRUE) quant1 <- rnorm(300) + ifelse(color=="red", 1, 0) quant2 <- rnorm(300) + quant1*.5 bin1 <- ifelse(quant1+rnorm(300) > 1, 1, 0) bin2 <- ifelse(quant2+rnorm(300) > 1, 1, 0) simData <- data.frame(color, quant1, quant2, bin1, bin2) head(simData)

To use the 'tboot' package we must first code the variables as numeric matrix. To do this, we code the 'color' variable above with two dummy variables.

dataset=as.matrix(cbind( colorBlue=ifelse(simData$color=="blue",1,0), colorBrown=ifelse(simData$color=="brown",1,0), simData[,-1])) colMeans(dataset)

# Example 1: Constraining the mean for all variables in the data

Suppose that, in the bootstrapped dataset, we want all the column means to be around 0.4.

target <- c(colorBlue=0.4, colorBrown=0.4, quant1=0.4, quant2=0.4, bin1=0.4, bin2=0.4)

We now need to find the row-level resampling weights. The weights are as close to uniform as possible while still making sure the bootstrap approximates the target column means.

weights <- tweights(dataset = dataset, target = target)

Next, we bootstrap a very large sample using the weights.

boot <- tboot(weights = weights, nrow = 1e5)

The column means are close to the target even though all the rows came from the original data.

colMeans(boot)

The weights for each sample will now differ:

hist(weights$weights, breaks=25) abline(v=1/300,col="red") The red line in the histogram above represents the probability for uniform resampling. We recommend against using the resampling methods implemented in this package when the weight of any one sample is too high. In some cases, this may be improved by transforming variables or removing/Winsorizing outlier samples. Also, using a different distance measure as described below in the section on Methodology may help. In general, if the target value is far from the observed mean of your data, it is likely that a handful of samples will be highly weighted. # Example 2: Constraining the mean for a subset of the variables Suppose we wish to constrain the mean for only as subset of the variables. For example, we may wish to constrain only the variables 'quant2' and 'bin1' to have a mean value of 0.4 and 0.5 respectively. We would determine the weights as follows: weights <- tweights(dataset = dataset, target = c(quant1=0.5, quant2=0.5)) We can now bootstrap from the entire dataset as like this: boot <- tboot(weights, nrow = 1e5) rbind("dataset mean" = colMeans(dataset), "tbootstrap mean" = colMeans(boot)) As may be seen, by changing the target mean of 'quant2' and 'bin1' we have also significantly changed the mean of other variables such as 'quant1' and 'bin2.' In many cases this is exactly what we would want to do, but it may occasionally be problematic. For example, if the one of the variables is a baseline variable in a clinical trial. In this case, it is recommended that the user constrain the mean of the baseline covariate to be the same as in the observed data. The actual weights for each sample may be extracted as 'weights$weigths.' As an instructive look at what the weights represent here is a graph of the weights. The red point marks the position of the new assumed mean while the black point marks the position of the data sample mean.

## Unable to find an optimum

The algorithm may fail to converge properly, and an error or warning will be issued.

## Some samples have a very high weight

As previously mentioned if any sample is too highly weighted, the tilted bootstrap approach is not recommended. See Example 1 above for more discussion about this issue.

## Augmented Sampling to Improved Stability

In many cases, the problems described above can be avoided by the use of the 'Nindependent' option in the 'tweights' function. We suggest that typically, if this option is used, the user set Nindependent=1. This means that one one additional "special" sample is assumed to exist in the original data. If 'tboot' draws this special sample, it will proceed to bootstrap each variable independently so that the variables are independent for that patient. The weights for each independent variable bootstrap are set so that (if possible) the "special" sample would average to the target. In effect, the Nindependent option makes the correlation structure slightly less dependent. In exchange for the small bias created, the user will be able to better achieve a target even when the dataset available is small or ill-conditioned. This method is used for the Bayesian marginal reconstruction method as it helps eliminate errors in corners of the parameters space that do not fit well with the data but still occasionally occur. In these cases, the independent sample may occur frequently, so the simulation defaults towards greater independence when the data is not consistent with the parameter. Here is an example of using the Nindependent option where correlation is slightly smaller with augmentation:

x1=rnorm(1000) x2=rnorm(1000)*sqrt(.1) + x1*sqrt(.9) dataset2=data.frame(x1=x1, x2=x2) weights_no_augmentation <- tweights(dataset = dataset2, target = c(x1=0.2, x2=-0.2)) weights_augmentation <- tweights(dataset = dataset2, target = c(x1=0.2, x2=-0.2), Nindependent=1) boot_no_augmentation <- tboot(weights_no_augmentation, nrow = 1e5) boot_augmentation <- tboot(weights_augmentation, nrow = 1e5) cor(boot_no_augmentation) cor(boot_augmentation)