Learn R Programming

mosaics (version 2.10.0)

mosaicsRunAll: Analyze ChIP-seq data using the MOSAiCS framework

Description

Construct bin-level ChIP-sep data from aligned read files of ChIP and matched control samples, fit a MOSAiCS model, call peaks, export peak calling results, and generate reports for diagnostics.

Usage

mosaicsRunAll( chipFile=NULL, chipFileFormat=NULL, controlFile=NULL, controlFileFormat=NULL, binfileDir=NULL, peakFile=NULL, peakFileFormat=NULL, reportSummary=FALSE, summaryFile=NULL, reportExploratory=FALSE, exploratoryFile=NULL, reportGOF=FALSE, gofFile=NULL, PET=FALSE, byChr=FALSE, useChrfile=FALSE, chrfile=NULL, excludeChr=NULL, FDR=0.05, fragLen=200, binSize=200, capping=0, bgEst="rMOM", d=0.25, signalModel="BIC", maxgap=200, minsize=50, thres=10, parallel=FALSE, nCore=8 )

Arguments

chipFile
Name of the aligned read file of ChIP sample to be processed.
chipFileFormat
Format of the aligned read file of ChIP sample to be processed. Currently, mosaicsRunAll permits the following aligned read file formats: "eland_result" (Eland result), "eland_extended" (Eland extended), "eland_export" (Eland export), "bowtie" (default Bowtie), "sam" (SAM), "bed" (BED), and "csem" (CSEM BED). Note that "csem" does not mean CSEM output file format, but CSEM BED file format.
controlFile
Name of the aligned read file of matched control sample to be processed.
controlFileFormat
Format of the aligned read file of matched control sample to be processed. Currently, mosaicsRunAll permits the following aligned read file formats: "eland_result" (Eland result), "eland_extended" (Eland extended), "eland_export" (Eland export), "bowtie" (default Bowtie), "sam" (SAM), "bed" (BED), and "csem" (CSEM BED). Note that "csem" does not mean CSEM output file format, but CSEM BED file format.
binfileDir
Directory to store processed bin-level files.
peakFile
Name of the peak list generated from the analysis.
peakFileFormat
Format of the peak list generated from the analysis. Possible values are "txt", "bed", and "gff".
reportSummary
Report the summary of model fitting and peak calling? Possible values are TRUE (YES) and FALSE (NO). Default is FALSE (NO).
summaryFile
File name of the summary report of model fitting and peak calling. The summary report is a text file.
reportExploratory
Report the exploratory analysis plots? Possible values are TRUE (YES) and FALSE (NO). Default is FALSE (NO).
exploratoryFile
Name of the file for exploratory analysis plots. The exploratory analysis results are exported as a PDF file.
reportGOF
Report the goodness of fit (GOF) plots? Possible values are TRUE (YES) and FALSE (NO). Default is FALSE (NO).
gofFile
Name of the file for goodness of fit (GOF) plots. The GOF plots are exported as a PDF file.
PET
Is the file paired-end tag (PET) data? If PET=FALSE, it is assumed that the file is SET data. If PET=TRUE, it is assumed that the file is PET data. Default is FALSE (SET data).
byChr
Analyze ChIP-seq data for each chromosome separately or analyze it genome-wide? Possible values are TRUE (chromosome-wise) and FALSE (genome-wide). Default is FALSE (genome-wide analysis).
useChrfile
Is the file for chromosome info provided? Possible values are TRUE or FALSE. If useChrfile=FALSE, it is assumed that the file for chromosome info is not provided. If useChrfile=TRUE, it is assumed that the file for chromosome info is provided. Default is FALSE.
chrfile
Name of the file for chromosome info. In this file, the first and second columns are ID and size of each chromosome, respectively.
excludeChr
Vector of chromosomes that will be excluded from the analysis.
FDR
False discovery rate. Default is 0.05.
fragLen
Average fragment length. Default is 200.
binSize
Size of bins. Default is 200.
capping
Maximum number of reads allowed to start at each nucleotide position. To avoid potential PCR amplification artifacts, the maximum number of reads that can start at a nucleotide position is capped at capping. Capping is not applied if non-positive capping is used. Default is 0 (no capping).
bgEst
Parameter to determine background estimation approach. Possible values are "matchLow" (estimation using bins with low tag counts) and "rMOM" (estimation using robust method of moment (MOM)). If bgEst="automatic", this method tries to make the best guess for bgEst, based on the data provided. Default is bgEst="rMOM".
d
Parameter for estimating background distribution. Default is 0.25.
signalModel
Signal model. Possible values are "BIC" (automatic model selection using BIC), "1S" (one-signal-component model), and "2S" (two-signal-component model). Default is "BIC".
maxgap
Initial nearby peaks are merged if the distance (in bp) between them is less than maxgap. Default is 200.
minsize
An initial peak is removed if its width is narrower than minsize. Default is 50.
thres
A bin within initial peak is removed if its ChIP tag counts are less than thres. Default is 10.
parallel
Utilize multiple CPUs for parallel computing using "parallel" package? Possible values are TRUE (use multiple CPUs) or FALSE (do not use multiple CPUs). Default is FALSE (do not use multiple CPUs).
nCore
Number of maximum number of CPUs used for the analysis. Default is 8.

Value

Processed bin-level files are exported to the directory specified in binfileDir argument. If byChr=FALSE (genome-wide analysis), one bin-level file is generated for each of ChIP and matched control samples, where file names are [chipFile]_fragL[fragLen]_bin[binSize].txt and [controlFile]_fragL[fragLen]_bin[binSize].txt, respectively, for SET data (PET = FALSE). For PET data (PET = TRUE), file names for each of ChIP and matched control samples are [chipFile]_bin[binSize].txt and [controlFile]_bin[binSize].txt, respectively. If byChr=TRUE (chromosome-wise analysis), bin-level files are generated for each chromosome of each of ChIP and matched control samples, where file names are [chipFile]_fragL[fragLen]_bin[binSize]_[chrID].txt and [controlFile]_fragL[fragLen]_bin[binSize]_[chrID].txt, respectively, for SET data (PET = FALSE) ([chrID] is chromosome IDs that reads align to). For PET data (PET = TRUE), file names for each of ChIP and matched control samples are [chipFile]_bin[binSize]_[chrID].txt and [controlFile]_bin[binSize]_[chrID].txt, respectively.The peak list generated from the analysis are exported to the file with the name specified in peakFile. If reportSummary=TRUE, the summary of model fitting and peak calling is exported to the file with the name specified in summaryFile (text file). If reportExploratory=TRUE, the exploratory analysis plots are exported to the file with the name specified in exploratoryFile (PDF file). If reportGOF=TRUE, the goodness of fit (GOF) plots are exported to the file with the name specified in gofFile (PDF file).

Details

This method implements the work flow for the two-sample analysis of ChIP-seq data using the MOSAiCS framework (without using mappability and GC content scores). It imports aligned read files of ChIP and matched control samples, processes them into bin-level files, fits MOSAiCS model, calls peaks, exports the peak lists to text files, and generates reports for diagnostics. This method is a wrapper function of constructBins, readBins, mosaicsFit, mosaicsPeak, export functions, and methods of BinData, MosaicsFit, and MosaicsPeak classes.

See the vignette of the package for the illustration of the work flow and the description of employed methods and their options. Exploratory analysis plots and goodness of fit (GOF) plots are generated using the methods plot of the classes BinData and MosaicsFit, respectively. See the help of constructBins for details of the options PET, chipFileFormat, controlFileFormat, byChr, useChrfile, chrfile, excludeChr, fragLen, binSize, and capping. See the help of mosaicsFit for details of the options bgEst and d. See the help of mosaicsPeak for details of the options FDR, signalModel, maxgap, minsize, and thres. See the help of export for details of the option peakFileFormat.

When the data contains multiple chromosomes, parallel computing can be utilized for faster preprocessing and model fitting if parallel=TRUE and parallel package is loaded. nCore determines number of CPUs used for parallel computing.

References

Kuan, PF, D Chung, G Pan, JA Thomson, R Stewart, and S Keles (2011), "A Statistical Framework for the Analysis of ChIP-Seq Data", Journal of the American Statistical Association, Vol. 106, pp. 891-903.

Chung, D, Zhang Q, and Keles S (2014), "MOSAiCS-HMM: A model-based approach for detecting regions of histone modifications from ChIP-seq data", Datta S and Nettleton D (eds.), Statistical Analysis of Next Generation Sequencing Data, Springer.

See Also

constructBins, readBins, mosaicsFit, mosaicsPeak, export, BinData, MosaicsFit, MosaicsPeak.

Examples

Run this code
## Not run: 
# # minimal input (without any reports for diagnostics)
# 
# mosaicsRunAll( 
#     chipFile = "/scratch/eland/STAT1_eland_results.txt", 
#     chipFileFormat = "eland_result", 
#     controlFile = "/scratch/eland/input_eland_results.txt", 
#     controlFileFormat = "eland_result", 
#     binfileDir = "/scratch/bin/", 
#     peakFile = "/scratch/peak/STAT1_peak_list.bed", 
#     peakFileFormat = "bed" )
#     
# # generate all reports for diagnostics  
# 
# library(parallel)    
# mosaicsRunAll( 
#     chipFile = "/scratch/eland/STAT1_eland_results.txt", 
#     chipFileFormat = "eland_result", 
#     controlFile = "/scratch/eland/input_eland_results.txt", 
#     controlFileFormat = "eland_result", 
#     binfileDir = "/scratch/bin/", 
#     peakFile = "/scratch/peak/STAT1_peak_list.bed", 
#     peakFileFormat = "bed",
#     reportSummary = TRUE, 
#     summaryFile = "/scratch/reports/mosaics_summary.txt", 
#     reportExploratory = TRUE, 
#     exploratoryFile = "/scratch/reports/mosaics_exploratory.pdf", 
#     reportGOF = TRUE, 
#     gofFile = "/scratch/reports/mosaics_GOF.pdf",
#     PET = FALSE, byChr = FALSE, useChrfile=FALSE, chrfile=NULL, excludeChr = "chrM", 
#     FDR = 0.05,  fragLen = 200, capping = 0, bgEst="automatic", thres=10,
#     parallel = TRUE, nCore = 8 )
# ## End(Not run)

Run the code above in your browser using DataLab