cleanUpdTSeq-package: This package classifies putative polyadenylation sites.

Description

3'ends of transcripts have generally been poorly annotated. With the advent of deep sequencing, many methods have been developed to identify 3'ends. The majority of these methods use an oligodT primer which can bind to internal adenine-rich sequences, and lead to artifactual identification of polyadenylation sites. Heuristic filtering methods rely on a certain number of As downstream of a putative polyadenylation site to classify the site as true or oligodT primed. This package provides a robust method to classify putative polyadenylation sites using a Naive Bayes classifier.

Arguments

Details

Package:

cleanUpdTSeq

Type:

Package

Version:

1.0

Date:

2013-07-22

License:

GPL-2

References

1. Meyer, D., et al., e1071: Misc Functions of the Department of Statistics (e1071), TU Wien. 2012.

2. Pages, H., BSgenome: Infrastructure for Biostrings-based genome data packages.

3. Sarah Sheppard, Nathan D. Lawson, and Lihua Julie Zhu. 2013. Accurate identification of polyadenylation sites from 3' end deep sequencing using a na\"ive Bayes classifier. Bioinformatics. Under revision

Examples

Run this code

#read in a test set
#### first install the package using the following command
#### biocLite("cleanUpdTSeq")
if (interactive())
{
	library(cleanUpdTSeq)
	testFile = system.file("extdata", "test.bed", package="cleanUpdTSeq")
	testSet = read.table(testFile, sep = "\t", header = TRUE)
	
	#convert the test set to GRanges with upstream and downstream sequence information
	peaks = BED2GRangesSeq(testSet,upstream.seq.ind = 7, downstream.seq.ind = 8, withSeq=TRUE)
	#build the feature vector for the test set with sequence information 
	testSet.NaiveBayes = buildFeatureVector(peaks,BSgenomeName = Drerio, upstream = 40,
	 downstream = 30, wordSize = 6, alphabet=c("ACGT"),
	 sampleType = "unknown",replaceNAdistance = 30, 
	method = "NaiveBayes", ZeroBasedIndex = 1, fetchSeq = FALSE)
	
	#convert the test set to GRanges without upstream and downstream sequence information
        peaks = BED2GRangesSeq(testSet,withSeq=FALSE)
        
	#build the feature vector for the test set without sequence information
	testSet.NaiveBayes = buildFeatureVector(peaks,BSgenomeName = Drerio, upstream = 40,
         downstream = 30, wordSize = 6, alphabet=c("ACGT"),
         sampleType = "unknown",replaceNAdistance = 30,
        method = "NaiveBayes", ZeroBasedIndex = 1, fetchSeq = TRUE)

	#predict the test set
	data(data.NaiveBayes)
	predictTestSet(data.NaiveBayes$Negative, data.NaiveBayes$Positive, testSet.NaiveBayes,
	outputFile = "test-predNaiveBayes.tsv", assignmentCutoff = 0.5)
}

Run the code above in your browser using DataLab