thin_2group: Binomial thinning in the two-group model.

Description

Given a matrix of real RNA-seq counts, this function will randomly assign samples to one of two groups, draw the log2-fold change in expression between two groups for each gene, and add this signal to the RNA-seq counts matrix. The user may specify the proportion of samples in each group, the proportion of null genes (where the log2-fold change is 0), and the signal function. This is a specific application of the general binomial thinning approach implemented in thin_diff.

Usage

thin_2group(
  mat,
  prop_null = 1,
  signal_fun = stats::rnorm,
  signal_params = list(mean = 0, sd = 1),
  group_prop = 0.5,
  corvec = NULL,
  alpha = 0,
  permute_method = c("optmatch", "hungarian", "marriage"),
  type = c("thin", "mult")
)

Arguments

mat

A numeric matrix of RNA-seq counts. The rows index the genes and the columns index the samples.

prop_null

The proportion of genes that are null.

signal_fun

A function that returns the signal. This should take as input n for the number of samples to return and then return only a vector of samples. Additional parameters may be passed through signal_params.

signal_params

A list of additional arguments to pass to signal_fun.

group_prop

The proportion of individuals that are in group 1.

corvec

A vector of target correlations. corvec[i] is the target correlation of the latent group assignment vector with the ith surrogate variable. The default is to set this to NULL, in which case group assignment is made independently of any unobserved confounding.

alpha

The scaling factor for the signal distribution from Stephens (2016). If \(x_1, x_2, ..., x_n\) are drawn from signal_fun, then the signal is set to \(x_1 s_1^{\alpha}, x_2 s_2^{\alpha}, ..., x_n s_n^{\alpha}\), where \(s_g\) is the empirical standard deviation of gene \(g\). Setting this to 0 means that the effects are exchangeable, setting this to 1 corresponds to the p-value prior of Wakefield (2009). You would rarely set this to anything but 0 or 1.

permute_method

Should we use the optimal matching technique from Hansen and Klopfer (2006) ("optmatch"), the Gale-Shapley algorithm for stable marriages ("marriage") (Gale and Shapley, 1962) as implemented in the matchingR package, or the Hungarian algorithm (Papadimitriou and Steiglitz, 1982) ("hungarian") as implemented in the clue package (Hornik, 2005)? The "optmatch" method works really well but does take a lot more computational time if you have, say, 1000 samples. If you use the "optmatch" option, you should note that the optmatch package uses a super strange license: https://cran.r-project.org/package=optmatch/LICENSE. If this license doesn't work for you (because you are not in academia, or because you don't believe in restrictive licenses), then try out the "hungarian" method.

type

Should we apply binomial thinning (type = "thin") or just naive multiplication of the counts (type = "mult"). You should always have this set to "thin".

Value

A list-like S3 object of class ThinData. Components include some or all of the following:

mat: The modified matrix of counts.
designmat: The design matrix of variables used to simulate signal. This is made by column-binding design_fixed and the permuted version of design_perm.
coefmat: A matrix of coefficients corresponding to designmat.
design_obs: Additional variables that should be included in your design matrix in downstream fittings. This is made by column-binding the vector of 1's with design_obs.
sv: A matrix of estimated surrogate variables. In simulation studies you would probably leave this out and estimate your own surrogate variables.
cormat: A matrix of target correlations between the surrogate variables and the permuted variables in the design matrix. This might be different from the target_cor you input because we pass it through fix_cor to ensure positive semi-definiteness of the resulting covariance matrix.
matching_var: A matrix of simulated variables used to permute design_perm if the target_cor is not NULL.

Details

The specific application of binomial thinning to the two-group model was used in Gerard and Stephens (2017) and Gerard and Stephens (2018). This is a specific case of the general method described in Gerard (2020).

References

Gale, David, and Lloyd S. Shapley. "College admissions and the stability of marriage." The American Mathematical Monthly 69, no. 1 (1962): 9-15.
Gerard, David and Matthew Stephens (2017). "Unifying and generalizing methods for removing unwanted variation based on negative controls." arXiv preprint arXiv:1705.08393.
David Gerard and Matthew Stephens (2018). "Empirical Bayes shrinkage and false discovery rate estimation, allowing for unwanted variation." Biostatistics, doi: 10.1093/biostatistics/kxy029.
Gerard, D (2020). "Data-based RNA-seq simulations by binomial thinning." BMC Bioinformatics. 21(1), 206. doi: 10.1186/s12859-020-3450-9.
Hansen, Ben B., and Stephanie Olsen Klopfer. "Optimal full matching and related designs via network flows." Journal of computational and Graphical Statistics 15, no. 3 (2006): 609-627.
Hornik K (2005). "A CLUE for CLUster Ensembles." Journal of Statistical Software, 14(12). doi: 10.18637/jss.v014.i12
C. Papadimitriou and K. Steiglitz (1982), Combinatorial Optimization: Algorithms and Complexity. Englewood Cliffs: Prentice Hall.
Stephens, Matthew. "False discovery rates: a new deal." Biostatistics 18, no. 2 (2016): 275-294.
Wakefield, Jon. "Bayes factors for genome-wide association studies: comparison with P-values." Genetic epidemiology 33, no. 1 (2009): 79-86.

Examples

Run this code

# NOT RUN {
## Simulate data from given matrix of counts
## In practice, you would obtain Y from a real dataset, not simulate it.
set.seed(1)
nsamp <- 10
ngene <- 1000
Y <- matrix(stats::rpois(nsamp * ngene, lambda = 50), nrow = ngene)
thinout <- thin_2group(mat           = Y,
                       prop_null     = 0.9,
                       signal_fun    = stats::rexp,
                       signal_params = list(rate = 0.5))

## 90 percent of genes are null
mean(abs(thinout$coef) < 10^-6)

## Check the estimates of the log2-fold change
Ynew <- log2(t(thinout$mat + 0.5))
X    <- thinout$designmat
Bhat <- coef(lm(Ynew ~ X))["X", ]
plot(thinout$coefmat,
     Bhat,
     xlab = "log2-fold change",
     ylab = "Estimated log2-fold change")
abline(0, 1, col = 2, lwd = 2)

# }

Run the code above in your browser using DataLab