distGPS: Compute matrix with pairwise distances between objects. Several GPS metrics are available.

Description

The function computes pairwise distances between invididuals (e.g. samples or genes) according to a user-specified metric. Several metrics are available. The precise definition of each metric depends on the class of the first argument (see details section).

Usage

distGPS(x, metric='tanimoto', weights, uniqueRows=FALSE, genomelength=NULL, mc.cores=1)

Arguments

Object for which we want to compute distances

metric

Desired distance metric. Valid options for chroGPS-factors map are 'tanimoto', 'avgdist', 'chisquare' and 'chi' (see details). For chroGPS-genes maps, metrics 'wtanimoto', 'euclidean' and 'manhattan' are also available.

weights

For signature(x='matrix'), an unnamed numeric vector with weights applied to every sample (column) in the original data. The typical example is when we have a sample (epigenetic factor) with several replicates available (biological or technical replicate, different antibody, etc.), and we want to treat them together (for instance giving a 1/nreplicates weight to each one). If not supplied, each replicate is considered as an individual sample (using 1 as weight for every sample).

uniqueRows

If set to TRUE and x is a matrix or data.frame, duplicated rows are removed prior to distance calculation. This can save substantial computing time and memory. Notice however that the dimension of the distance matrix is equal to the number of unique rows in x, instead of

nrow..
      (x)

genomelength

For 'chi' and 'chisquare' metrics, numeric value indicating the length of the genome. If not given the function uses the minimum length necessary to fit the total length of the result.

mc.cores

If mc.cores>1 and parallel package is loaded, computations are performed in parallel with mc.cores processors when possible.

Value

Object of class distGPS, with matrix of pairwise dissimilarities (distances) between objects.

Methods

signature(x='RangedDataList'): Each element in x is assumed to indicate the binding sites for a different sample, e.g. epigenetic factor. Typically space(x) indicates the chromosome, start(x) the start position and end(x) the end position (in bp). Strand information is ignored.
signature(x='matrix'): Rows in x contain individuals for which we want to compute distances. Columns in x contain the variables, and should only contain either 0's and 1's or FALSE and TRUE.

Details

For RangedDataList objects, distances are defined as follows. Let a1 and a2 be two RangedData objects. Define as n1 the number of a1 intervals overlapping with some interval in a2. Define n2 analogously. The Tanimoto distance between a1 and a2 is defined as (n1+n2)/(nrow(z1)+nrow(z2)). The average distance between a1 and a2 is defined as .5*(n1/nrow(z1) + n2/nrow(z2)). The wtanimoto distance in chroGPS-genes weights each epigenetic factor (table columns) according to its frequency (table rows). The chi-square distance is defined as the usual chi-square distance on a binary matrix B which is automatically computed by distGPS. The binary matrix B is the matrix with length(x) rows and number of columns equal to the genome length, where B[i,j]==1 indicates that element i has a binding site at base pair j. The chi distance is simply defined as the square root of the chi-square distance. Finally, euclidean and manhattan metrics have the same definition than in the base R function dist.

When choosing a metric one should consider the effect of outliers, i.e. samples with large distance to all other samples. Tanimoto and Average Distance take values between 0 and 1, and therefore outlying distances have a limited effect. Chi-square and Chi distances are not limited between 0 and 1, i.e. some distances may be much larger than others. The Chi metric is slightly more robust to outliers than the Chi-square metric. For matrix or data.frame objects, x must be a matrix with 0's and 1's (or FALSE and TRUE). The usual definitions are used for Tanimoto (which is equivalent to Jaccard's index), Chi-square and Chi. Average overlap between rows i and j is simply the average between the proportion of elements in i also in j and the proportion of elements in j also in i.

Examples

Run this code

x <- rbind(c(rep(0,15),rep(1,5)),c(rep(0,15),rep(1,5)),c(rep(0,19),1),c(rep(1,5),rep(0,15)))
rownames(x) <- letters[1:4]
d <- distGPS(x,metric='tanimoto')
du <- distGPS(x,metric='tanimoto',uniqueRows=TRUE)
mds1 <- mds(d)
mds1
plot(mds1)
d <- distGPS(x,metric='chisquare')
mds1 <- mds(d)
mds1
plot(mds1)

Run the code above in your browser using DataLab