findClusters: Find Clusters Epigenetically Modified Genes

Description

Given a table of gene positions that has a score column, genes will first be sorted into positional order and consecutive windows of high or low scores will be reported.

Usage

findClusters(stats, score.col = NULL, w.size = NULL, n.med = NULL, n.consec = NULL, cut.samps = NULL, maxFDR = 0.05, trend = c("down", "up"), n.perm = 100, getFDRs = FALSE, verbose = TRUE)

Arguments

stats

A data.frame with (at least) column chr, and a column of scores. Genes must be sorted in positional order.

score.col

A number that gives the column in stats which contains the scores.

w.size

The number of consecutive genes to consider windows over. Must be odd.

n.med

Minimum number of genes in a window, that have median score centred around them above a cutoff.

n.consec

Minimum cluster size.

cut.samps

A vector of score cutoffs to calculate the FDR at.

maxFDR

The highest FDR level still deemed to be significant.

trend

Whether the clusters must have all positive scores (enrichment), or all negative scores (depletion).

n.perm

How many random tables to generate to use in the FDR calculations.

getFDRs

If TRUE, will also return the table of FDRs at a variety of score cutoffs, from which the score cutoff for calling clusters was chosen.

verbose

Whether to print progress of computations.

Value

table: The table stats with the additional column cluster.
FDR: The table of score cutoffs tried, and their FDRs.

Details

First, the median over a window of size w.size is calculated in a rolling window and then associated with the middle gene of the window. Windows are again run over the genes, and the gene at the centre of the window is significant if there are also at least n.med genes with representative medians above the score cutoff, in the window that surrounds it. These marker genes are extended outwards, for as long as the score has the same sign. The order of the stats rows is randomised, and this process in done for every randomisation.

The procedure for calling clusters is done at a range of score cutoffs. The first score cutoff to give an FDR below maxFDR is chosen as the cutoff to use, and clusters are then called based on this cutoff.

References

Saul Bert, in preparation

Examples

Run this code

  chrs <- sample(paste("chr", c(1:5), sep = ""), 500, replace = TRUE)
  starts <- sample(1:10000000, 500, replace = TRUE)
  ends <- starts + 10000
  genes <- data.frame(chr = chrs, start = starts, end = ends, strand = '+')
  genes <- genes[order(genes$chr, genes$start), ]
  genes$t.stat = rnorm(500, 0, 2)
  genes$t.stat[21:30] = rnorm(10, 4, 1)
  findClusters(genes, 5, 5, 2, 3, seq(1, 10, 1), trend = "up", n.perm = 2)

Run the code above in your browser using DataLab