compHistDists: Compute distances between pairs of histograms

Description

This function computes for each peak pairwise distances between histograms according to the specified method, currently Maximum Mean Discrepancy (MMD), Generalized Minimum Distance (GMD) and simple Pearson correlation (Pearson) are implemented.

Usage

compHistDists(DBA, method = 'MMD', CompIDs=NULL, Usefiltered = TRUE, PeakIDs = NULL, NormMethod = 'DESeq', overWrite = FALSE, HistField = 'PeakRawHists', run.parallel = TRUE, verbose = 2, save.file = TRUE, out.dir='.',sigma=NULL)

Arguments

DBA

DBA object, after running getPeakProfiles. Specifically, it uses the element MD, which contains a list of histogram matrices. (see the getPeakProfiles documentation for more information about this data type.)

method

specify what method should be used to determine distances between histograms, could be 'MMD' [1], 'GMD' [2] or simple 'Pearson' correlation

CompIDs

2 x nComps matrix, specifying sample ids of pairwise comparisons

Usefiltered

If TRUE, only peaks that have passed the filter to detect Outliers are considered. findOutlier() must be run first, otherwise all peaks are used

PeakIDs

Specify a subset of peaks for which distances should be completed

NormMethod

specify which normalization method should be used, currently only the 'DESeq' method [3] is implemented. Note, that unless NormMethod=NULL, getNormFactors has to be called first.

overWrite

if TRUE, overwrites earlier computed distances.

HistField

name of element in MD that is used to determine distances. This element should again be a list of nPeaks peaks, each containing a matrix of histograms (nSamples x nbins). It can be generated by running getPeakProfiles. Note, nbins may vary between peaks, if they have different length.

run.parallel

distribute over available CPUs

verbose

for debugging, set to 3 for some extra output

save.file

if TRUE, DBA objects are saved

out.dir

directory for saving output files

sigma

parameter controlling the Kernel size

Value

DBA object, with additional list element DISTS added to MD. DISTS again contains a list element named according to method applied (e.g. MMD). This elemnt is a matrix (nPeaks x nComps) containing all pairwise distances.

References

[1] Gretton A. et al )(2006). A kernel methods for the two-sample-problem. In NIPS, pages 513--520, MIT Press

[2] Zhao et al (2012). GMD: Measuring the distance between histograms with applications on high-throughput sequencing reads, Bioinformatics, 28 (8): 1164-1165.

[3] Anders S. and Huber W. (2010). Differential expression analysis for sequence count data Genome Biology, 11 (10): R106

Examples

Run this code


# load DBA objects with peak profiles 
data(Cfp1Profiles)

# get normalization factors
Cfp1Norm <- getNormFactors(Cfp1Profiles)

# get all pairwise distances for the samples WT, Null and Resc i.e. WT
# vs Null, WT vs Resc and WT vs Resc: Recommended is the method 'MMD'
# [1], however, this may take a little while. Here, we compute the GMD
# distance instead [2].

Cfp1Dists <- compHistDists(Cfp1Norm, method = 'GMD', 
           NormMethod = 'DESeq') 




# You can also specify, which pairwise distances you are interessted in,
#  e.g.:

CompIDs <- cbind(c("WT.AB2", "Null.AB2"),
c("WT.AB2", "Resc.AB2"),
c("Null.AB2", "Resc.AB2"))

Cfp1Dists2 <- compHistDists(Cfp1Norm, method='GMD', CompIDs=CompIDs,
            NormMethod='DESeq')




# To view pairwise distances you can use the function plotHistDists. For
# example, treating WT and Resc as control replicates and Null as a
# treatment group, you can contrast the 'within-group' distances with 
# 'between-group' distances:

group1 <- c("WT.AB2","Resc.AB2")
group2 <- c("Null.AB2") #
plotHistDists(Cfp1Dists, group1=group1, group2=group2, method='GMD')

#see detPeakPvals to determine which peaks are significantly different
#between the two groups.

Run the code above in your browser using DataLab