tallyRanges: Tallying function with a `GRanges` interface.

Description

Functions for tallying bam files in genomic intervals provided as GRanges objects, special version of the function for direct writing or computation on a cluster exist.

Usage

tallyRanges(bamfiles, ranges, reference, q = 25, ncycles = 10, max.depth = 1e+06)
tallyRangesToFile(tallyFile, study, bamfiles, ranges, reference, samples = NULL, q = 25, ncycles = 0, max.depth=1e6)
tallyRangesBatch(tallyFile, study, bamfiles, ranges, reference, q = 25, ncycles = 10, max.depth=1e6, regID = "Tally", res = list("ncpus" = 2, "memory" = 24000, "queue"="research-rh6"), written = c(), wrfile = "written.jobs.RDa", waitTime = Inf)

Arguments

bamfiles

Character vector giving the locations of the bam files to be tallied

ranges

A GRanges object describing the ranges that tallies shalle be generated in, e.g. the result of a call to binGenome or a set of exon or gene annotations provided by a TxDB object.

reference

BSgenome object describing the reference genome that the alignments were made against.

samples

The indices (within the HDF5 datasets) corresponding to the samples that the data represents. You can use this option to write sub-sets of samples from a cohort.

Read alignment quality cut-off.

ncycles

Number of cycles from the front and back of the reads that should be considered unreliable for mismatch detection

max.depth

Maximum depth of coverage to consider

tallyFile

Filename of the HDF5 tally file that the data shall be written to

study

The location within the HDF5 file that corresponds to the HDF5-group representing the study we are working on.

regID

Identifier for a BatchJobs registry which will be used to store and organise the cluster jobs used for parallelisation of the work.

res

Resource list specifying the compute resources to be requested for each of the cluster jobs.

written

Numerical vector indicating the Job IDs of jobs whose results have already been written to the tally file, this can be used to resume writing after a crash.

wrfile

Filename for a file to store the IDs of already written jobs in, can be used to resume writing after a crash.

waitTime

How long shall the function wait on cluster jobst to finish, before giving up. Default is wait forever.

Value

For tallyRanges the return value is a list of lists, where the top level corresponds to the ranges provided as an input to the function and each element is a list of the datasets in compatible format, that can directly be written to an HDF5 file using the writeToTallyFile function. The other two function perform the writing directly and return

Details

tallyRanges returns the tallies corresponding to the specifed ranges, tallyToFile performs the same task but writes the results to the tally file directly. tallyRangesBatch uses the BatchJobs package to set up cluster jobs for tallying and collects and writes the results of those jobs to the tally file. It is important to have a properly configured cluster (inlcuding a .BatchJobs.R as well as a template file). See the documentation of BatchJobs for that information.

Examples

Run this code

suppressPackageStartupMessages(library("h5vc"))
suppressPackageStartupMessages(library("rhdf5"))
files <- list.files( system.file("extdata", package = "h5vcData"), "Pt.*bam$" )
bamFiles <- file.path( system.file("extdata", package = "h5vcData"), files)
suppressPackageStartupMessages(require(BSgenome.Hsapiens.NCBI.GRCh38))
suppressPackageStartupMessages(require(GenomicRanges))
dnmt3a <- read.table(system.file("extdata", "dnmt3a.txt", package = "h5vcData"), header=TRUE, stringsAsFactors = FALSE)
dnmt3a <- with( dnmt3a, GRanges(seqname, ranges = IRanges(start = start, end = end)))
dnmt3a <- reduce(dnmt3a)
require(BiocParallel)
register(MulticoreParam())
theData <- tallyRanges( bamFiles, ranges = dnmt3a[1:3], reference = Hsapiens )
str(theData)

Run the code above in your browser using DataLab