Learn R Programming

h5vc (version 2.6.3)

batchTallies: Tallying bam files in parallel using BatchJobs on high performance compute clusters (HPC)

Description

These function tally a set of bam files in blocks spanning a specified region and write the results to an HDF5 tally file; uses BatchJobs for parallel computation on HPCs

Usage

batchTallyParam(
  bamFiles,
  destination,
  group,
  chrom, start, stop,
  blocksize = 100000,
  registryDir = tempdir(),
  resources = list("queue" = "research-rh6", "memory"="4000", "ncpus"="4", walltime="90:00"),
  q=25, ncycles = 0, max.depth=1000000,
  reference = NULL,
  sleep = 5
)

batchTallies( confList = batchTallyParam() )

rerunBatchTallies( confList, tryCollect = TRUE )

collectTallies(blocks, confList, registries )

Arguments

bamFiles
A character vector of filenames of the bam files that should be tallies. Note that for writing to an HDF5 file the order of this vector must match the order of the Column field in the sampledata object that corresponds to the dataset - see setSampleData for details.
reference
A DNAString object containing the reference sequence corresponding to the region that is to be tallied -- if this is NULL a consensus vote will be used to estimate the reference at any given position, this means you cannot detect variants with AF >= 0.5 anymore -- especially when tallying more than one bamFile you really should specify this
destination
Filename of the HDF5 tally file that will be written to -- this needs to contain all the groups and datasets already -- see prepareTallyFile for details
group
Location within the tally file where the data will be written -- e.g. "/ExampleStudy/22"
chrom
Chromosome in which to tally
start
First position of the tally
stop
Last position of the tally
q
quality cut-off for considering a base call
ncycles
number of sequencing cycles form the front and back of the read that should be considered unreliable - used for stratifying the nucleotide counts
max.depth
only tally a position if there are less than this many reads overlapping it - can prevent long runtimes in unreliable regions
blocksize
Size of the blocks in bases that the tallying will be performed in, this influences the number of jobs send to the cluster
registryDir
Directory in which the registries created by BatchJobs wil be held, this can be temporary since we delete them when we are done
resources
A named list specifying the resource requirements of the cluster jobs, this must contain names for the fields specified in the cluster configuration file -- see the documentation of BatchJobs for details
confList
A configuration list as returned by a call to batchTallyParam()
sleep
Number of seconds to sleep before checking if blocks are finshed, increase this if you have large blocks and find the output of batchTallies to verbose
tryCollect
Boolean flag specifying whether the rerunBatchTallies function should try to collect data from the specified registries before re-submitting.
blocks
data.frame defining blocks to tally in, result of a cal to defineBlocks
registries
A list mapping registry IDs to the work paths of the corresponding registries

Value

  • [None] -- prints progress messages along the way.

Details

This is a wrapper function for applying tallyBAM to a set of bam files specified in the bamFiles argument. The order or samples along the sample dimension is the same as the order of the file names (i.e. the order of the bamfiles argument). The function uses BatchJobs to dispatch tallying in blocks along the genome to a HPC and collects the results and writes them into the HDF5 tally file specified in the destination parameter.

rerunBatchTallies can be used to re-submit failed blocks.

collectTallies can be used to manually collect tally data from the registries created by batchTallies

Examples

Run this code
library(h5vc)
files <- c("NRAS.AML.bam","NRAS.Control.bam")
bamFiles <- file.path( system.file("extdata", package = "h5vcData"), files)
chrom = "1"
startpos <- 115247090
endpos <- 115259515
batchTallies( batchTallyParam(bamFiles, chrom, startpos, endpos) )

Run the code above in your browser using DataLab