clusterSites: Assigns CpG cluster memberships on CpG sites within `BSraw` objects

Description

Within a BSraw object clusterSites searches for agglomerations of CpG sites across all samples. In a first step the data is reduced to CpG sites covered in round(perc.samples*ncol(object)) samples, these are called 'frequently covered CpG sites'. In a second step regions are detected where not less than min.sites frequently covered CpG sites are sufficiantly close to each other (max.dist). Note, that the frequently covered CpG sites are considered to define the boundaries of the CpG clusters only. For the subsequent analysis the methylation data of all CpG sites within these clusters are used.

Usage

clusterSites(object, groups, perc.samples, min.sites, max.dist,
mc.cores, ...)

Arguments

object

A BSraw.

groups

OPTIONAL. A factor specifying two or more sample groups within the given object. See Details.

perc.samples

A numeric between 0 and 1. Is passed to filterBySharedRegions.

min.sites

A numeric. Clusters should comprise at least min.sites CpG sites which are covered in at least perc.samples of samples, otherwise clusters are dropped.

max.dist

A numeric. CpG sites which are covered in at least perc.samples of samples within a cluster should not be more than max.dist bp apart from their nearest neighbors.

mc.cores

Passed to mclapply Default is 1.

...

Further arguments passed to the filterBySharedRegions function. closer than

Value

A BSraw object reduced to CpG sites within CpG cluster regions. A cluster.id metadata column on the rowRanges assigns cluster memberships per CpG site.

Details

There are three parameters that are important: perc.samples, min.sites and max.dist. For example, if perc.samples=0.5, the algorithm detects all CpG sites that are covered in at least 50% of the samples. Those CpG sites are called frequently covered CpG sites. In the next step the algorithm determines the distances between neighboured frequently covered CpG sites. When they are closer than (or close as) max.dist base pairs to each other, those frequently covered CpG sites and all other, less frequently covered CpG sites that are in between, belong to the same cluster. In the third step, each cluster is checked for the number of frequently covered CpG sites. If this number is less than min.sites, the cluster is discarded.

In other words: 1. The perc.samples parameter defines which are the frequently covered CpG sites. 2. The frequently covered CpG sites determine the boundaries of the clusters, depending on their distance to each other. 3. Clusters are discarded if they have too less frequently covered CpG sites.

If argument group is given, perc.samples, or no.samples, are applied for all group levels.

Examples

Run this code

data(rrbs)
rrbs.clust <- clusterSites(object = rrbs, groups = colData(rrbs)$group,
                           perc.samples = 4/5, min.sites = 20,
                           max.dist = 100)