Learn R Programming

Repitools (version 1.18.0)

findClusters: Find Clusters Epigenetically Modified Genes

Description

Given a table of gene positions that has a score column, genes will first be sorted into positional order and consecutive windows of high or low scores will be reported.

Usage

findClusters(stats, score.col = NULL, w.size = NULL, n.med = NULL, n.consec = NULL, cut.samps = NULL, maxFDR = 0.05, trend = c("down", "up"), n.perm = 100, getFDRs = FALSE, verbose = TRUE)

Arguments

stats
A data.frame with (at least) column chr, and a column of scores. Genes must be sorted in positional order.
score.col
A number that gives the column in stats which contains the scores.
w.size
The number of consecutive genes to consider windows over. Must be odd.
n.med
Minimum number of genes in a window, that have median score centred around them above a cutoff.
n.consec
Minimum cluster size.
cut.samps
A vector of score cutoffs to calculate the FDR at.
maxFDR
The highest FDR level still deemed to be significant.
trend
Whether the clusters must have all positive scores (enrichment), or all negative scores (depletion).
n.perm
How many random tables to generate to use in the FDR calculations.
getFDRs
If TRUE, will also return the table of FDRs at a variety of score cutoffs, from which the score cutoff for calling clusters was chosen.
verbose
Whether to print progress of computations.

Value

If getFDRs is FALSE, then only the stats table, with an additional column, cluster. If getFDRs is TRUE, then a list with elements :
table
The table stats with the additional column cluster.
FDR
The table of score cutoffs tried, and their FDRs.

Details

First, the median over a window of size w.size is calculated in a rolling window and then associated with the middle gene of the window. Windows are again run over the genes, and the gene at the centre of the window is significant if there are also at least n.med genes with representative medians above the score cutoff, in the window that surrounds it. These marker genes are extended outwards, for as long as the score has the same sign. The order of the stats rows is randomised, and this process in done for every randomisation.

The procedure for calling clusters is done at a range of score cutoffs. The first score cutoff to give an FDR below maxFDR is chosen as the cutoff to use, and clusters are then called based on this cutoff.

References

Saul Bert, in preparation

Examples

Run this code
  chrs <- sample(paste("chr", c(1:5), sep = ""), 500, replace = TRUE)
  starts <- sample(1:10000000, 500, replace = TRUE)
  ends <- starts + 10000
  genes <- data.frame(chr = chrs, start = starts, end = ends, strand = '+')
  genes <- genes[order(genes$chr, genes$start), ]
  genes$t.stat = rnorm(500, 0, 2)
  genes$t.stat[21:30] = rnorm(10, 4, 1)
  findClusters(genes, 5, 5, 2, 3, seq(1, 10, 1), trend = "up", n.perm = 2)

Run the code above in your browser using DataLab