processPipeline: A pipeline to accurately transform aligned reads to genomic enrichment scores

Description

This function automatizes the processing of chromatin sequecing data analysis by chaining all required steps from the reading of aligned reads to the writing of enrichment scores in WIG/GFF files.

Usage

processPipeline(
        # I/O GENERAL PARAMETERS
        INPUTFilesList,
        resultSubFolder             = "Results_Pasha",
        reportFilesSubFolder        = ifelse(length(resultSubFolder)>1,
                                                resultSubFolder[2],
                                                "ReportFiles"),
        WIGfs                       = TRUE,
        WIGvs                       = FALSE,
        GFF                         = FALSE,
        # COMPLEX PARAMETERS (SINGLE OR VECTORS OR LIST OF IT)
        incrArtefactThrEvery        = 7000000,
        binSize                     = 50,
        elongationSize              = NA,
        rangeSelection              = IRanges(0,-1), 
        annotationFilesGFF          = NA, # GFF files
        annotationGenomeFiles       = NA, # path to file or "mm9", "hg19"... 
        # SINGLE PARAMETERS
        elongationEstimationRange   = c(mini=150, maxi=400, by=10),
        rehabilitationStep          = c("orphans","orphansFromArtefacts"),
        removeChrNamesContaining    = "random|hap",
        ignoreInsertsOver           = 500,
        nbCPUs                      = 1,
        keepTemp                    = TRUE, # Keep the intermediary files
                                            # that led to the final ones
                                            # (rehab and multireads) 
        logTofile                   = "./log.txt", 
        eraseLog                    = FALSE, 
        # LIST PARAMETERS (one element per expName)
        multiLocFilesList           = list()) # A list with experiments
                                              # names and associated
                                              # filenames to treat

Arguments

INPUTFilesList

A named list of list. Each element name will be used as experiment ID to generate output files and messages. Each sublist must provide the following elements as described (see details for more information on individual elements) : 'folderName', '

resultSubFolder

A single character string. Name of the subfolder to be created in the folder where each experiment file is. The results files will be stored in this subfolder

reportFilesSubFolder

A single character string. Name of the subfolder in which log files and QC reports will be stored. This name is used to create a filder under resultSubFolder

WIGfs

A single boolean. If TRUE, a result file in WIG fixed step format (see binSize parameter) will be generated

WIGvs

A single boolean. If TRUE, a result file in WIG variable step format will be generated

GFF

A single boolean. If TRUE, a result file in GFF format (see binSize parameter) will be generated. Typically useless except for compatibility with some peak detection algorithms such as ones used for chIP-on-chip experiments

incrArtefactThrEvery

A complex parameter (see details). A numeric value or NA. A positive numeric value defines a threshold to consider piles like 'artefacts' as 'number of reads in the experiment devided by the value'. A negative value allows to directly set the thr

binSize

A complex parameter (see details). A striclty positive value (>0) defining the size of the steps in the resulting WIG file. When 1 there is one score per genomic coordinate (no binning)

elongationSize

A complex parameter (see details). A numeric value that tells how much each read must be extended to fit the actual insert size. If NA, the elongation estimation module will be used to automatically determine the overrepresented insert size in th

rangeSelection

A complex parameter (see details). An 'IRanges' object that defines the different groups of reads to be made. In case of single-end experiments, these ranges are applied to reads size whereas in case of paired-ends experiments, the groups are mad

annotationFilesGFF

A complex parameter (see details). A named vector of gff file paths. If this argument and annotationGenomeFiles are provided, the pipeline will generate or each rangeSelection (and total) a pdf file summarizing reads occupancy among annotations i

annotationGenomeFiles

A complex parameter (see details). Needed for plotting reads statistics on annotations (see argument 'annotationFilesGFF'). A single path to a genome reference file or the ID of a genome for which a reference is bundled in the package (to see bun

elongationEstimationRange

A numeric vector with 3 named elements : 'mini', 'maxi', 'by'. These values define the range and the granularity that will be used for elongation estimation

rehabilitationStep

A character vector. Can contain 'orphans', 'orphansFromArtefacts'. See 'Rehabilitation steps' in 'details' section

removeChrNamesContaining

A single regular expression. Defines a pattern that will match chromosome (seqnames) names to be removed from the experiment

ignoreInsertsOver

A single strictly positive integer, or NA. In case of paired-ends experiments, one might want to ignore inserts above a certain size (which are probably the result of misalignment)

nbCPUs

A single strictly positive integer.If several cores are available, the program can work on several chromosomes in parallel. This decreases the time needed for processing but uses more memory

keepTemp

A single logical. If TRUE, after the merge, the intermediary files will not be erased. It concerns the wig files of subgroups such as 'orphans', 'orphansFromArtefacts', 'multireads'. If FALSE only the merged result will be kept

logTofile

A single string, or NULL. If the string defines a valid filename, the log messages for all experiments will be written in it. Note that there is a local copy of the log for each experiment in the result folder

eraseLog

A single logical. In its default behavior, the pipeline stops if a log file with the same name already exists. This is a security to avoid deleting previously computed results. One can disable this security by putting this parameter as TRUE

multiLocFilesList

A named (or empty) list. Names of elements must match the experiments names defined in 'INPUTFilesList'. Each element must be the full path to a text file resulting from multiread processing (see multiread specific commands in the package)

Value

A list containing information referring to all processed experiments. Each element contains at least one nested list element called "execTime", which summarize the time spent on important data processing steps (see log file).

describe

filenamethe file name of the wig file (fixed step WIG).
inputFolderthe path to the wig file.
outputFolderthe path to the folder where analysis results must be stored.
repeatMaskerFilePaththe path to the file containing the repeat annotations (Repeat Masker file).
isRegexIf TRUE, the filename parameter is interpreted as a regular expression (LC_SYNTAX) and the script will search for a unique file corresponding to the provided regular expression. If no or several file are found, the scripts ends with error.

itemize

first line the number of chromosomes in the considered genome

item

second line, the list of chromosomes names (separated by spaces)
third line, the list of chromosomes sizes (separated by spaces)

Details

The 'pipeline' covers most of the required steps to convert chromatin aligned reads information to piled-up enrichments scores (ie. WIG files). It makes use of the package functions to :

Read and store in memory dataset(s) provided by the user

Generate eventual subgroups of interest based on the insert size (in case of paired-end experiment) or on the reads size, providing statistics about the regions covered and the number of reads concerned Help to identify and remove artefactual enrichments Estimate in-silico the size of original DNA fragments (inserts, in case of single-end experiments), or plotting the insert size distribution otherwise Extension and piling of reads with several options to obtain a score per genomic coordinates Writing results as WIG variable, WIG fixed-step, or GFF files

Examples

Run this code

# This first (don't run) part of example aims at presenting all
  # available parameters and some variations, see at the bottom of
  # this section for a running example
  
  # Define an experiment description list with a classic epigenetic
  # mark and one MNase experiment that will be seen as nucleosome
  # 'density' or nucleosome 'positionning' (see midpoint parameter)
  
  myExps <- list()
  myExps[["mES_H3K4me3"]] <- list('folderName'="/home/exp",
                                  'fileName'="SRR432543.BAM",
                                  'fileType'="BAM",
                                  'chrPrefix'="chr",
                                  'chrSuffix'="",
                                  'pairedEnds'=FALSE,
                                  'midPoint'=FALSE)
                                             
  myExps[["mES_MNase"]] <- list('folderName'="/home/exp",
                                'fileName'="SPT543426.BAM",
                                'fileType'="BAM", 
                                'chrPrefix'="chr", 
                                'chrSuffix'="", 
                                'pairedEnds'=TRUE, 
                                'midPoint'=FALSE)
                                              
  myExps[["mES_MNase_MIDPOINT"]] <- list('folderName'="/home/exp",
                                         'fileName'="SPT543426.BAM",
                                         'fileType'="BAM", 
                                         'chrPrefix'="chr", 
                                         'chrSuffix'="",
                                         'pairedEnds'=TRUE, 
                                         'midPoint'=TRUE)
  
  # Call the pipeline for the three experiments with basic parameters
      
  processPipeline(
  
  #### I/O GENERAL PARAMETERS
  
  # Experiments description list
  INPUTFilesList              = myExps,
  
  # name of the folder that will contain results
  resultSubFolder             = "Results",
          
  # name of the folder that wil contain logs and figures
  reportFilesSubFolder        = "ReportFiles",
      
  # generate results as WIG fixed steps
  WIGfs                       = TRUE,
               
  # generate results as WIG variable steps
  WIGvs                       = TRUE,
               
  # generate results as GFF files
  GFF                         = FALSE,
              
  #### COMPLEX PARAMETERS (SINGLE OR VECTORS OR LIST OF IT)
  
  # The threshold to detect artefactual piles will be incremented
  # by one every 10Million reads aligned for each experiment
  incrArtefactThrEvery        = 10000000,
           
  # Along the genome one score every 50 basepairs will be computed
  binSize                     = 50,
                 
  # The reads will be extended according to the in-silico estimation
  # algorithm or based on the pairs alignments (insert size) 
  elongationSize              = NA,
   
  # No subgroups selection for specific inserts or reads size
  rangeSelection              = IRanges(0,-1),
  
  # no GFF files given, the module plotting statistics on reads and
  # annotations will not be loaded
  annotationFilesGFF          = NA,
      
  # path to file or "mm9", "hg19"... This argument is needed only if 
  # gff files are specified in 'annotationFilesGFF' argument
  annotationGenomeFiles       = NA,
  
  #### SINGLE PARAMETERS
  
  # For single-end experiments, the fragment size will be estimated
  # between 50 and 400 with a resolution of 10bp
  elongationEstimationRange   = c(mini=50, maxi=400, by=10),
   
  # The pipeline will try to save half-pairs from alignment and the 
  # ones broken during 'artefact' removal
  rehabilitationStep          = c("orphans","orphansFromArtefacts"),
  
  # Remove chromosomes with names containing "random" or "un" 
  removeChrNamesContaining    = "random|un",
        
  # For paired-ends ignore inserts > 500bp according to alignment
  ignoreInsertsOver           = 500,
  
  # Use 1 cpu (recommended as first try to estimate the memory usage)              
  nbCPUs                      = 1,
                  
  # Do not erase pileups and result files from subcategories prior to
  # merging (orphans etc...)
  keepTemp                    = TRUE,
               
  # make a copy of the log for all experiments
  logTofile                   = "./log.txt",
        
  # In case the same computation is restarted, do not warn the user
  # and erase previous results
  eraseLog                    = TRUE,
               
  #### LIST PARAMETERS (one element per expName)
  
  # An eventual list of multiread repartition results
  multiLocFilesList           = "");
                
  ########
  # The four "complex parameters" could have been declared like this
  # to generate more results
  
  # as a vector of values, each value will be used sequentially for
  # all experiments
  # incrArtefactThrEvery <- c(10000000,NA, -10)
  
  # as a list for specifying arguments for individual experiments
  # binSize              <- list("mES_H3K4me3"=200, 
  #                                   "mES_MNase"=50,
                                  "mES_MNase_MIDPOINT"=50)
  
  # mixed, some experiment have one value, others have several
  # elongationSize       <- list("mES_H3K4me3"=c(NA,0),
  #                                   "mES_MNase"=c(146,NA),
  #                                   "mES_MNase_MIDPOINT"=NA) 
                                     
  # Compute without elongating reads (0), a fixed numeric value (not
  # recommended), or estimate in-silico (or based on pairs) the
  # optimal elongation (NA) 
  # rangeSelection <- list("mES_H3K4me3" =IRanges(0,-1), 
  #                        "mES_MNase"=c(IRanges(0,-1),
  #                                      IRanges(0,100), 
  #                                      IRanges(100,1000)),
  #                        "mES_MNase_MIDPOINT"=c(IRanges(0,-1),
  #                                               IRanges(0,100), 
  #                                               IRanges(100,1000)))  


#############################################
#### Actual runnable example on BAM file ####
#############################################

# Define a temporary directory where the example will run
exampleFolder <- tempdir()

# Get the path to the example BAM file and copy it (with the index)
testFileBAM_fileName <- "embedDataTest.bam"
testFileBAM_fullPath <- system.file("extdata",
                                    testFileBAM_fileName,
                                    package="Pasha")
file.copy(testFileBAM_fullPath, exampleFolder)

testFileBAI_fileName <- "embedDataTest.bam.bai"
testFileBAI_fullPath <- system.file("extdata",
                                    testFileBAI_fileName,
                                    package="Pasha")
file.copy(testFileBAI_fullPath, exampleFolder)
# Create the data structure containing information on the experiments
INPUTFilesList <- list()
INPUTFilesList[["testBAM"]] <- list(folderName=exampleFolder, 
                                    fileName=testFileBAM_fileName,
                                    fileType="BAM", 
                                    chrPrefix="chr", 
                                    chrSuffix="", 
                                    pairedEnds=TRUE, 
                                    midPoint=FALSE)
# Start the pipeline using default parameters
processPipeline(INPUTFilesList)

Run the code above in your browser using DataLab