CoGAPS: `CoGAPS` calls the C++ MCMC code through gapsRun and performs Bayesian matrix factorization returning the two matrices that reconstruct the data matrix and then calls calcCoGAPSStat to estimate gene set activity with nPerm set to 500

Description

CoGAPS calls the C++ MCMC code through gapsRun and performs Bayesian matrix factorization returning the two matrices that reconstruct the data matrix and then calls calcCoGAPSStat to estimate gene set activity with nPerm set to 500

Usage

CoGAPS(data, unc, ABins = data.frame(), PBins = data.frame(), GStoGenes,
  nFactor = 7, simulation_id = "simulation", nEquil = 1000,
  nSample = 1000, nOutR = 1000, output_atomic = FALSE,
  fixedBinProbs = FALSE, fixedDomain = "N", sampleSnapshots = TRUE,
  numSnapshots = 100, plot = TRUE, nPerm = 500, alphaA = 0.01,
  nMaxA = 1e+05, max_gibbmass_paraA = 100, alphaP = 0.01, nMaxP = 1e+05,
  max_gibbmass_paraP = 100)

Arguments

data

data matrix

unc

uncertainty matrix (std devs for chi-squared of Log Likelihood)

ABins

a matrix of same size as A which gives relative probability of that element being non-zero

PBins

a matrix of same size as P which gives relative probability of that element being non-zero

GStoGenes

data.frame or list with gene sets

nFactor

number of patterns (basis vectors, metagenes)

simulation_id

name to attach to atoms files if created

nEquil

number of iterations for burn-in

nSample

number of iterations for sampling

nOutR

how often to print status into R by iterations

output_atomic

whether to write atom files (large)

fixedBinProbs

Boolean for using relative probabilities given in Abins and Pbins

fixedDomain

character to indicate whether A or P is domain for relative probabilities

sampleSnapshots

Boolean to indicate whether to capture individual samples from Markov chain during sampling

numSnapshots

the number of individual samples to capture

plot

Boolean to indicate whether to produce output graphics

nPerm

number of permutations in gene set test

alphaA

sparsity parameter for A domain

nMaxA

PRESENTLY UNUSED, future = limit number of atoms

max_gibbmass_paraA

limit truncated normal to max size

alphaP

sparsity parameter for P domain

nMaxP

PRESENTLY UNUSED, future = limit number of atoms

max_gibbmass_paraP

limit truncated normal to max size

Value

A list containing:
meanChi2Value of $chi^2$ for Amean and Pmean.
DData matrix ${\bf{D}}$ input to factorization.
Sigmauncertainty matrix (std devs for chi-squared of Log Likelihood)
AmeanSampled mean value of the amplitude matrix ${\bf{A}}$.
AsdSampled standard deviation of the amplitude matrix ${\bf{A}}$.
PmeanSampled mean value of the amplitude matrix ${\bf{P}}$.
PsdSampled standard deviation of the amplitude matrix ${\bf{P}}$.
GSUpregp-values for upregulation of each gene set in each pattern.
GSDownregp-values for downregulation of each gene set in each pattern.
GSActEstp-values for activity of each gene set in each pattern.

Details

CoGAPS first decomposes the data matrix using GAPS, ${\bf{D}}$, into a basis of underlying patterns and then determines the gene set activity in each of these patterns. The GAPS decomposition is achieved by finding amplitude and pattern matrices (${\bf{A}}$ and ${\bf{P}}$, respectively) for which $${\bf{D}} = {\bf{A}}{\bf{P}} + \Sigma,$$ where $\Sigma$ is the matrix of uncertainties given by unc. The matrices $\bf{A}$ and $\bf{P}$ are assumed to have the atomic prior described in Sibisi and Skilling (1997) and are found with MCMC sampling. Then, the patterns identified in the columns of $\bf{P}$ are linked to activity in each of the gene sets specified in GStoGenes using a novel z-score based statistic developed in Ochs et al. (2009). Specifically, the z-score for pattern $p$ and gene set $G_{i}$ containing $G$ total genes is given by $$Z_{i,p} = \frac{1}{G} \sum_{g in \mathcal{G_{i}}} {\frac{{\bf{A}_{gp}}}{Asd_{gp}}},$$ where $g$ indexes the genes in the set and $Asd_{gp}$ is the standard deviation of ${\bf{A}}_{gp}$ obtained from MCMC sampling. CoGAPS then uses the specified nPerm random sample tests to compute a consistent p value estimate from that z score. Note that the data from Ochs et al. (2009) are provided with this package in GIST_TS_20084.RData and TFGSList.RData are also provided with this package for further validation.

Examples

Run this code

## Load data
nIter <- 5000

## Run GAPS matrix decomposition with gene set statistic
results <- CoGAPS(data=SimpSim.D, unc=SimpSim.S,
                  GStoGenes=GSets,
                  nFactor=3,
                  nEquil=nIter, nSample=nIter,
                  plot=FALSE)


## Plot the results
plotGAPS(results$Amean, results$Pmean, 'GSFigs')