prepare.nbp: Prepare the Data Structure for NBP Test

Description

Create the NBP data structure, (optionally) normalize the counts, and thin the counts to make the effective library sizes equal.

Usage

prepare.nbp(counts, grp.ids, lib.sizes = colSums(counts), norm.factors = NULL, 
    thinning = TRUE, print.level = 1)

Arguments

counts

an $n$ by $r$ matrix of RNA-Seq read counts with rows corresponding to genes (exons, gene isoforms, etc) and columns corresponding to libraries (independent biological samples).

grp.ids

an $r$ vector of treatment group identifiers (e.g. integers).

lib.sizes

library sizes. By default, library sizes are estimated by column sums.

norm.factors

normalization factors. If NULL (default), no normalization will be applied.

thinning

a boolean variable. If TRUE (default), the counts will be randomly down sampled to make effective library sizes approximately equal.

print.level

controls the amount of messages printed: 0 for suppressing all messages, 1 (default) for basic progress messages, and 2 to 5 for increasingly more detailed messages.

Value

A list containing the following components:
countsthe count matrix, same as input.
lib.sizescolumn sums of the count matrix.
grp.idsa vector of identifiers of treatment groups, same as input.
eff.lib.sizeseffective library sizes, lib.sizes multiplied by the normalization factors.
pseudo.countscount matrix after thinning.
pseduo.lib.sizeseffective library sizes of pseudo counts, i.e., column sums of the pseudo count matrix multiplied by the normalization.

Details

Normalization

We take gene expression to be indicated by relative frequency of RNA-Seq reads mapped to a gene, relative to library sizes (column sums of the count matrix). Since the relative frequencies sum to 1 in each library (one column of the count matrix), the increased relative frequencies of truly over expressed genes in each column must be accompanied by decreased relative frequencies of other genes, even when those others do not truly differently express. Robinson and Oshlack (2010) presented examples where this problem is noticeable.

A simple fix is to compute the relative frequencies relative to effective library sizes---library sizes multiplied by normalization factors. Many authors (Robinson and Oshlack (2010), Anders and Huber (2010)) propose to estimate the normalization factors based on the assumption that most genes are NOT differentially expressed.

By default, prepare.nbp does not estimate the normalization factors, but can incorporate user specified normalization factors through the argument norm.factors.

Library Size Adjustment The exact test requires that the effective library sizes (column sums of the count matrix multiplied by normalization factors) are approximately equal. By default, prepare.nbp will thin (downsample) the counts to make the effective library sizes equal. Thinning may lose statistical efficiency, but is unlikely to introduce bias.

Examples

Run this code

## Load Arabidopsis data
  data(arab);

  ## Specify treatment groups
  grp.ids = c(1, 1, 1, 2, 2, 2);

  ## Prepare an NBP object, adjust the library sizes by thinning the counts.
  set.seed(999);
  obj = prepare.nbp(arab, grp.ids, print.level=5);

  ## Print the NBP object
  print.nbp(obj);