GUIDEseqAnalysis: Analysis pipeline for GUIDE-seq dataset

Description

A wrapper function that uses the UMI sequence plus the first few bases of each sequence from R1 reads to estimate the starting sequence library, piles up reads with a user defined window and step size, identify the insertion sites (proxy of cleavage sites), merge insertion sites from plus strand and minus strand, followed by off target analysis of extended regions around the identified insertion sites.

Usage

GUIDEseqAnalysis(alignment.inputfile, umi.inputfile, alignment.format = c("auto", "bam", "bed"),  umi.header = FALSE, read.ID.col = 1, umi.col = 2, umi.sep = "\t", BSgenomeName,  gRNA.file, outputDir, n.cores.max = 6, keep.R1only = TRUE, keep.R2only = TRUE, concordant.strand = TRUE, max.paired.distance = 1000, min.mapping.quality = 30, max.R1.len = 130, max.R2.len = 130, apply.both.max.len = FALSE, same.chromosome = TRUE, distance.inter.chrom = -1, min.R1.mapped = 20, min.R2.mapped = 20, apply.both.min.mapped = FALSE, max.duplicate.distance = 0, umi.plus.R1start.unique = TRUE, umi.plus.R2start.unique = TRUE, window.size = 20L, step = 20L, bg.window.size = 5000L, min.reads = 5L, min.reads.per.lib = 1L, min.SNratio = 2, maxP = 0.05, stats = c("poisson", "nbinom"), p.adjust.methods = c( "none", "BH", "holm", "hochberg", "hommel", "bonferroni", "BY", "fdr"), distance.threshold = 40L, max.overlap.plusSig.minusSig = 10L, plus.strand.start.gt.minus.strand.end = TRUE, gRNA.format = "fasta", overlap.gRNA.positions = c(17,18), upstream = 50, downstream = 50, PAM.size = 3, gRNA.size = 20, PAM = "NGG", PAM.pattern = "(NAG|NGG|NGA)$", max.mismatch = 6, allowed.mismatch.PAM = 2, overwrite = TRUE, weights = c(0, 0, 0.014, 0, 0, 0.395, 0.317, 0, 0.389, 0.079, 0.445, 0.508, 0.613, 0.851, 0.732, 0.828, 0.615,0.804, 0.685, 0.583), orderOfftargetsBy = c("predicted_cleavage_score", "n.mismatch"), descending = c(TRUE, FALSE), keepTopOfftargetsOnly = TRUE)

Arguments

alignment.inputfile

The alignment file. Currently supports bam and bed output file with CIGAR information. Suggest run the workflow binReads.sh, which sequentially runs barcode binning, adaptor removal, alignment to genome, alignment quality filtering, and bed file conversion. Please download the workflow function and its helper scripts at http://mccb.umassmed.edu/GUIDE-seq/binReads/

umi.inputfile

A text file containing at least two columns, one is the read identifier and the other is the UMI or UMI plus the first few bases of R1 reads. Suggest use getUMI.sh to generate this file. Please download the script and its helper scripts at http://mccb.umassmed.edu/GUIDE-seq/getUMI/

alignment.format

The format of the alignment input file. Default bed file format. Currently only bed file format is supported, which is generated from binReads.sh

umi.header

Indicates whether the umi input file contains a header line or not. Default to FALSE

read.ID.col

The index of the column containing the read identifier in the umi input file, default to 1

umi.col

The index of the column containing the umi or umi plus the first few bases of sequence from the R1 reads, default to 2

umi.sep

column separator in the umi input file, default to tab

BSgenomeName

BSgenome object. Please refer to available.genomes in BSgenome package. For example, BSgenome.Hsapiens.UCSC.hg19 for hg19, BSgenome.Mmusculus.UCSC.mm10 for mm10, BSgenome.Celegans.UCSC.ce6 for ce6, BSgenome.Rnorvegicus.UCSC.rn5 for rn5, BSgenome.Drerio.UCSC.danRer7 for Zv9, and BSgenome.Dmelanogaster.UCSC.dm3 for dm3

gRNA.file

gRNA input file path or a DNAStringSet object that contains gRNA plus PAM sequences used for genome editing

outputDir

the directory where the off target analysis and reports will be written to

n.cores.max

Indicating maximum number of cores to use in multi core mode, i.e., parallel processing, default 6. Please set it to 1 to disable multicore processing for small dataset.

keep.R1only

Specify whether to include alignment with only R1 without paired R2. Default TRUE

keep.R2only

Specify whether to include alignment with only R2 without paired R1. Default TRUE

concordant.strand

Specify whether the R1 and R2 should be aligned to the same strand or opposite strand. Default opposite.strand (TRUE)

max.paired.distance

Specify the maximum distance allowed between paired R1 and R2 reads. Default 1000 bp

min.mapping.quality

Specify min.mapping.quality of acceptable alignments

max.R1.len

The maximum retained R1 length to be considered for downstream analysis, default 130 bp. Please note that default of 130 works well when the read length 150 bp. Please also note that retained R1 length is not necessarily equal to the mapped R1 length

max.R2.len

The maximum retained R2 length to be considered for downstream analysis, default 130 bp. Please note that default of 130 works well when the read length 150 bp. Please also note that retained R2 length is not necessarily equal to the mapped R2 length

apply.both.max.len

Specify whether to apply maximum length requirement to both R1 and R2 reads, default FALSE

same.chromosome

Specify whether the paired reads are required to align to the same chromosome, default TRUE

distance.inter.chrom

Specify the distance value to assign to the paired reads that are aligned to different chromosome, default -1

min.R1.mapped

The maximum mapped R1 length to be considered for downstream analysis, default 30 bp.

min.R2.mapped

The maximum mapped R2 length to be considered for downstream analysis, default 30 bp.

apply.both.min.mapped

Specify whether to apply minimum mapped length requirement to both R1 and R2 reads, default FALSE

max.duplicate.distance

Specify the maximum distance apart for two reads to be considered as duplicates, default 0. Currently only 0 is supported

umi.plus.R1start.unique

To specify whether two mapped reads are considered as unique if both containing the same UMI and same alignment start for R1 read, default TRUE.

umi.plus.R2start.unique

To specify whether two mapped reads are considered as unique if both containing the same UMI and same alignment start for R2 read, default TRUE.

window.size

window size to calculate coverage

step

step size to calculate coverage

bg.window.size

window size to calculate local background

min.reads

minimum number of reads to be considered as a peak

min.reads.per.lib

minimum number of reads in each library (usually two libraries) to be considered as a peak

min.SNratio

minimum signal noise ratio, which is the coverage normalized by local background

maxP

Maximum p-value to be considered as significant

stats

Statistical test, default poisson

p.adjust.methods

Adjustment method for multiple comparisons, default none

distance.threshold

Specify the maximum gap allowed between the plus strand and the negative strand peak, default 40. Suggest set it to twice of window.size used for peak calling.

max.overlap.plusSig.minusSig

Specify the maximum overlap (cushion distance) between plus strand peak and minus strand peak. Default to 10L to allow sequence error and inprecise integration. Only applicable if plus.strand.start.gt.minus.strand.end is set to TRUE.

plus.strand.start.gt.minus.strand.end

Specify whether plus strand peak start greater than the paired negative strand peak end. Default to TRUE

gRNA.format

Format of the gRNA input file. Currently, fasta is supported

PAM.size

PAM length, default 3

gRNA.size

The size of the gRNA, default 20

PAM

PAM sequence after the gRNA, default NGG

overlap.gRNA.positions

The required overlap positions of gRNA and restriction enzyme cut site, default 17 and 18 for SpCas9.

max.mismatch

Maximum mismatch allowed in off target search, default 6

PAM.pattern

Regular expression of protospacer-adjacent motif (PAM), default (NAG|NGG|NGA)$ for off target search

allowed.mismatch.PAM

Number of degenerative bases in the PAM sequence, default to 2 for N[A|G]G PAM

upstream

upstream offset from the peak start to search for off targets, default 50

downstream

downstream offset from the peak end to search for off targets, default 50

overwrite

overwrite the existing files in the output directory or not, default FALSE

weights

a numeric vector size of gRNA length, default c(0, 0, 0.014, 0, 0, 0.395, 0.317, 0, 0.389, 0.079, 0.445, 0.508, 0.613, 0.851, 0.732, 0.828, 0.615, 0.804, 0.685, 0.583) for SPcas9 system, which is used in Hsu et al., 2013 cited in the reference section. Please make sure that the number of elements in this vector is the same as the gRNA.size, e.g., pad 0s at the beginning of the vector.

orderOfftargetsBy

criteria to order the offtargets by. By default, order by predicted_cleavage_score (descending order) followed by n.mismatch (ascending order) User can change the order of these two criteria and change descending order accordingly

descending

In the descending or ascending order. Default to order by predicted cleavage score in descending order and number of mismatch in ascending order When altering orderOfftargetsBy order, please also modify descending accordingly

keepTopOfftargetsOnly

Output all offtargets or the top offtarget using the orderOfftargetsBy criteria, default to the top offtarget

Value

offTargets: a data frame, containing all input peaks with potential gRNA binding sites, mismatch number and positions, alignment to the input gRNA and predicted cleavage score.
merged.peaks: merged peaks as GRanges
peaks: GRanges with count (peak height), bg (local background), SNratio (signal noise ratio), p-value, and option adjusted p-value
peaks: GRanges with count (peak height), bg (local background), SNratio (signal noise ratio), p-value, and option adjusted p-value
uniqueCleavages: Cleavage sites with one site per UMI as GRanges with metadata column total set to 1 for each range
read.summary: One table per input mapping file that contains the number of reads for each chromosome location

References

Shengdar Q Tsai and J Keith Joung et al. GUIDE-seq enables genome-wide profiling of off-target cleavage by CRISPR-Cas nucleases. Nature Biotechnology 33, 187 to 197 (2015)

Examples

Run this code

if(interactive())
    {
        library("BSgenome.Hsapiens.UCSC.hg19")
        umiFile <- system.file("extdata", "UMI-HEK293_site4_R1.txt",
           package = "GUIDEseq")
        alignFile <- system.file("extdata","bowtie2.HEK293_site4.sort.bed" ,
            package = "GUIDEseq")
        gRNA.file <- system.file("extdata","gRNA.fa", package = "GUIDEseq")
        guideSeqRes <- GUIDEseqAnalysis(
            alignment.inputfile = alignFile,
            umi.inputfile = umiFile, gRNA.file = gRNA.file,
            BSgenomeName = Hsapiens, min.reads = 80, n.cores.max = 1)
        names(guideSeqRes)
   }