mosaicsRunAll: Analyze ChIP-seq data using the MOSAiCS framework

Description

Construct bin-level ChIP-sep data from aligned read files of ChIP and matched control samples, fit a MOSAiCS model, call peaks, export peak calling results, and generate reports for diagnostics.

Usage

mosaicsRunAll(  chipFile=NULL, chipFileFormat=NULL,  controlFile=NULL, controlFileFormat=NULL,  binfileDir=NULL,  peakFile=NULL, peakFileFormat=NULL, reportSummary=FALSE, summaryFile=NULL,  reportExploratory=FALSE, exploratoryFile=NULL,  reportGOF=FALSE, gofFile=NULL,  PET=FALSE, byChr=FALSE, useChrfile=FALSE, chrfile=NULL, excludeChr=NULL,  FDR=0.05, fragLen=200, binSize=200, capping=0, bgEst="rMOM", d=0.25,  signalModel="BIC", maxgap=200, minsize=50,  thres=10, parallel=FALSE, nCore=8 )

Arguments

chipFile

Name of the aligned read file of ChIP sample to be processed.

chipFileFormat

Format of the aligned read file of ChIP sample to be processed. Currently, mosaicsRunAll permits the following aligned read file formats: "eland_result" (Eland result), "eland_extended" (Eland extended), "eland_export" (Eland export), "bowtie" (default Bowtie), "sam" (SAM), "bed" (BED), and "csem" (CSEM BED). Note that "csem" does not mean CSEM output file format, but CSEM BED file format.

controlFile

Name of the aligned read file of matched control sample to be processed.

controlFileFormat

Format of the aligned read file of matched control sample to be processed. Currently, mosaicsRunAll permits the following aligned read file formats: "eland_result" (Eland result), "eland_extended" (Eland extended), "eland_export" (Eland export), "bowtie" (default Bowtie), "sam" (SAM), "bed" (BED), and "csem" (CSEM BED). Note that "csem" does not mean CSEM output file format, but CSEM BED file format.

binfileDir

Directory to store processed bin-level files.

peakFile

Name of the peak list generated from the analysis.

peakFileFormat

Format of the peak list generated from the analysis. Possible values are "txt", "bed", and "gff".

reportSummary

Report the summary of model fitting and peak calling? Possible values are TRUE (YES) and FALSE (NO). Default is FALSE (NO).

summaryFile

File name of the summary report of model fitting and peak calling. The summary report is a text file.

reportExploratory

Report the exploratory analysis plots? Possible values are TRUE (YES) and FALSE (NO). Default is FALSE (NO).

exploratoryFile

Name of the file for exploratory analysis plots. The exploratory analysis results are exported as a PDF file.

reportGOF

Report the goodness of fit (GOF) plots? Possible values are TRUE (YES) and FALSE (NO). Default is FALSE (NO).

gofFile

Name of the file for goodness of fit (GOF) plots. The GOF plots are exported as a PDF file.

PET

Is the file paired-end tag (PET) data? If PET=FALSE, it is assumed that the file is SET data. If PET=TRUE, it is assumed that the file is PET data. Default is FALSE (SET data).

byChr

Analyze ChIP-seq data for each chromosome separately or analyze it genome-wide? Possible values are TRUE (chromosome-wise) and FALSE (genome-wide). Default is FALSE (genome-wide analysis).

useChrfile

Is the file for chromosome info provided? Possible values are TRUE or FALSE. If useChrfile=FALSE, it is assumed that the file for chromosome info is not provided. If useChrfile=TRUE, it is assumed that the file for chromosome info is provided. Default is FALSE.

chrfile

Name of the file for chromosome info. In this file, the first and second columns are ID and size of each chromosome, respectively.

excludeChr

Vector of chromosomes that will be excluded from the analysis.

FDR

False discovery rate. Default is 0.05.

fragLen

Average fragment length. Default is 200.

binSize

Size of bins. Default is 200.

capping

Maximum number of reads allowed to start at each nucleotide position. To avoid potential PCR amplification artifacts, the maximum number of reads that can start at a nucleotide position is capped at capping. Capping is not applied if non-positive capping is used. Default is 0 (no capping).

bgEst

Parameter to determine background estimation approach. Possible values are "matchLow" (estimation using bins with low tag counts) and "rMOM" (estimation using robust method of moment (MOM)). If bgEst="automatic", this method tries to make the best guess for bgEst, based on the data provided. Default is bgEst="rMOM".

Parameter for estimating background distribution. Default is 0.25.

signalModel

Signal model. Possible values are "BIC" (automatic model selection using BIC), "1S" (one-signal-component model), and "2S" (two-signal-component model). Default is "BIC".

maxgap

Initial nearby peaks are merged if the distance (in bp) between them is less than maxgap. Default is 200.

minsize

An initial peak is removed if its width is narrower than minsize. Default is 50.

thres

A bin within initial peak is removed if its ChIP tag counts are less than thres. Default is 10.

parallel

Utilize multiple CPUs for parallel computing using "parallel" package? Possible values are TRUE (use multiple CPUs) or FALSE (do not use multiple CPUs). Default is FALSE (do not use multiple CPUs).

nCore

Number of maximum number of CPUs used for the analysis. Default is 8.

Value

Processed bin-level files are exported to the directory specified in binfileDir argument. If byChr=FALSE (genome-wide analysis), one bin-level file is generated for each of ChIP and matched control samples, where file names are [chipFile]_fragL[fragLen]_bin[binSize].txt and [controlFile]_fragL[fragLen]_bin[binSize].txt, respectively, for SET data (PET = FALSE). For PET data (PET = TRUE), file names for each of ChIP and matched control samples are [chipFile]_bin[binSize].txt and [controlFile]_bin[binSize].txt, respectively. If byChr=TRUE (chromosome-wise analysis), bin-level files are generated for each chromosome of each of ChIP and matched control samples, where file names are [chipFile]_fragL[fragLen]_bin[binSize]_[chrID].txt and [controlFile]_fragL[fragLen]_bin[binSize]_[chrID].txt, respectively, for SET data (PET = FALSE) ([chrID] is chromosome IDs that reads align to). For PET data (PET = TRUE), file names for each of ChIP and matched control samples are [chipFile]_bin[binSize]_[chrID].txt and [controlFile]_bin[binSize]_[chrID].txt, respectively.The peak list generated from the analysis are exported to the file with the name specified in peakFile. If reportSummary=TRUE, the summary of model fitting and peak calling is exported to the file with the name specified in summaryFile (text file). If reportExploratory=TRUE, the exploratory analysis plots are exported to the file with the name specified in exploratoryFile (PDF file). If reportGOF=TRUE, the goodness of fit (GOF) plots are exported to the file with the name specified in gofFile (PDF file).

Details

This method implements the work flow for the two-sample analysis of ChIP-seq data using the MOSAiCS framework (without using mappability and GC content scores). It imports aligned read files of ChIP and matched control samples, processes them into bin-level files, fits MOSAiCS model, calls peaks, exports the peak lists to text files, and generates reports for diagnostics. This method is a wrapper function of constructBins, readBins, mosaicsFit, mosaicsPeak, export functions, and methods of BinData, MosaicsFit, and MosaicsPeak classes.

See the vignette of the package for the illustration of the work flow and the description of employed methods and their options. Exploratory analysis plots and goodness of fit (GOF) plots are generated using the methods plot of the classes BinData and MosaicsFit, respectively. See the help of constructBins for details of the options PET, chipFileFormat, controlFileFormat, byChr, useChrfile, chrfile, excludeChr, fragLen, binSize, and capping. See the help of mosaicsFit for details of the options bgEst and d. See the help of mosaicsPeak for details of the options FDR, signalModel, maxgap, minsize, and thres. See the help of export for details of the option peakFileFormat.

When the data contains multiple chromosomes, parallel computing can be utilized for faster preprocessing and model fitting if parallel=TRUE and parallel package is loaded. nCore determines number of CPUs used for parallel computing.

References

Kuan, PF, D Chung, G Pan, JA Thomson, R Stewart, and S Keles (2011), "A Statistical Framework for the Analysis of ChIP-Seq Data", Journal of the American Statistical Association, Vol. 106, pp. 891-903.

Chung, D, Zhang Q, and Keles S (2014), "MOSAiCS-HMM: A model-based approach for detecting regions of histone modifications from ChIP-seq data", Datta S and Nettleton D (eds.), Statistical Analysis of Next Generation Sequencing Data, Springer.

Examples

Run this code

## Not run: 
# # minimal input (without any reports for diagnostics)
# 
# mosaicsRunAll( 
#     chipFile = "/scratch/eland/STAT1_eland_results.txt", 
#     chipFileFormat = "eland_result", 
#     controlFile = "/scratch/eland/input_eland_results.txt", 
#     controlFileFormat = "eland_result", 
#     binfileDir = "/scratch/bin/", 
#     peakFile = "/scratch/peak/STAT1_peak_list.bed", 
#     peakFileFormat = "bed" )
#     
# # generate all reports for diagnostics  
# 
# library(parallel)    
# mosaicsRunAll( 
#     chipFile = "/scratch/eland/STAT1_eland_results.txt", 
#     chipFileFormat = "eland_result", 
#     controlFile = "/scratch/eland/input_eland_results.txt", 
#     controlFileFormat = "eland_result", 
#     binfileDir = "/scratch/bin/", 
#     peakFile = "/scratch/peak/STAT1_peak_list.bed", 
#     peakFileFormat = "bed",
#     reportSummary = TRUE, 
#     summaryFile = "/scratch/reports/mosaics_summary.txt", 
#     reportExploratory = TRUE, 
#     exploratoryFile = "/scratch/reports/mosaics_exploratory.pdf", 
#     reportGOF = TRUE, 
#     gofFile = "/scratch/reports/mosaics_GOF.pdf",
#     PET = FALSE, byChr = FALSE, useChrfile=FALSE, chrfile=NULL, excludeChr = "chrM", 
#     FDR = 0.05,  fragLen = 200, capping = 0, bgEst="automatic", thres=10,
#     parallel = TRUE, nCore = 8 )
# ## End(Not run)

Run the code above in your browser using DataLab