batchTallies: Tallying bam files in parallel using `BatchJobs` on high performance compute clusters (HPC)

Description

These function tally a set of bam files in blocks spanning a specified region and write the results to an HDF5 tally file; uses BatchJobs for parallel computation on HPCs

Usage

batchTallyParam(
  bamFiles,
  destination,
  group,
  chrom, start, stop,
  blocksize = 100000,
  registryDir = tempdir(),
  resources = list("queue" = "research-rh6", "memory"="4000", "ncpus"="4", walltime="90:00"),
  q=25, ncycles = 0, max.depth=1000000,
  reference = NULL,
  sleep = 5
)
batchTallies( confList = batchTallyParam() )
rerunBatchTallies( confList, tryCollect = TRUE )
collectTallies(blocks, confList, registries )

Arguments

bamFiles

A character vector of filenames of the bam files that should be tallies. Note that for writing to an HDF5 file the order of this vector must match the order of the Column field in the sampledata object that corresponds to the dataset - see setSampleData for details.

reference

A DNAString object containing the reference sequence corresponding to the region that is to be tallied -- if this is NULL a consensus vote will be used to estimate the reference at any given position, this means you cannot detect variants with AF >= 0.5 anymore -- especially when tallying more than one bamFile you really should specify this

destination

Filename of the HDF5 tally file that will be written to -- this needs to contain all the groups and datasets already -- see prepareTallyFile for details

group

Location within the tally file where the data will be written -- e.g. "/ExampleStudy/22"

chrom

Chromosome in which to tally

start

First position of the tally

stop

Last position of the tally

quality cut-off for considering a base call

ncycles

number of sequencing cycles form the front and back of the read that should be considered unreliable - used for stratifying the nucleotide counts

max.depth

only tally a position if there are less than this many reads overlapping it - can prevent long runtimes in unreliable regions

blocksize

Size of the blocks in bases that the tallying will be performed in, this influences the number of jobs send to the cluster

registryDir

Directory in which the registries created by BatchJobs wil be held, this can be temporary since we delete them when we are done

resources

A named list specifying the resource requirements of the cluster jobs, this must contain names for the fields specified in the cluster configuration file -- see the documentation of BatchJobs for details

confList

A configuration list as returned by a call to batchTallyParam()

sleep

Number of seconds to sleep before checking if blocks are finshed, increase this if you have large blocks and find the output of batchTallies to verbose

tryCollect

Boolean flag specifying whether the rerunBatchTallies function should try to collect data from the specified registries before re-submitting.

blocks

data.frame defining blocks to tally in, result of a cal to defineBlocks

registries

A list mapping registry IDs to the work paths of the corresponding registries

Value

[None] -- prints progress messages along the way.

Details

This is a wrapper function for applying tallyBAM to a set of bam files specified in the bamFiles argument. The order or samples along the sample dimension is the same as the order of the file names (i.e. the order of the bamfiles argument). The function uses BatchJobs to dispatch tallying in blocks along the genome to a HPC and collects the results and writes them into the HDF5 tally file specified in the destination parameter.

rerunBatchTallies can be used to re-submit failed blocks.

collectTallies can be used to manually collect tally data from the registries created by batchTallies

Examples

Run this code

library(h5vc)
files <- c("NRAS.AML.bam","NRAS.Control.bam")
bamFiles <- file.path( system.file("extdata", package = "h5vcData"), files)
chrom = "1"
startpos <- 115247090
endpos <- 115259515
batchTallies( batchTallyParam(bamFiles, chrom, startpos, endpos) )

Run the code above in your browser using DataLab