Learn R Programming

polyester (version 1.6.0)

generate_fragments: generate a set of fragments from a set of transcripts

Description

Convert each sequence in a DNAStringSet to a "fragment" (subsequence)

Usage

generate_fragments(tObj, fraglen = 250, fragsd = 25, readlen = 100, distr = "normal", custdens = NULL, bias = "none")

Arguments

tObj
DNAStringSet of sequences from which fragments should be extracted
fraglen
Mean fragment length, if drawing fragment lengths from a normal distribution.
fragsd
Standard deviation of fragment lengths, if drawing lengths from a normal distribution. Note: fraglen and fragsd are ignored unless distr is 'normal'.
readlen
Read length. Default 100. Used only to label read positions.
distr
One of 'normal', 'empirical', or 'custom'. If 'normal', draw fragment lengths from a normal distribution with mean fraglen and standard deviation fragsd. If 'empirical', draw fragment lengths from a fragment length distribution estimated from a real data set. If 'custom', draw fragment lengths from a custom distribution, provided as the custdens argument, which should be a density fitted using logspline.
custdens
If distr is 'custom', draw fragments from this density. Should be an object of class logspline.
bias
One of 'none', 'rnaf', or 'cdnaf' (default 'none'). 'none' represents uniform fragment selection (every possible fragment in a transcript has equal probability of being in the experiment); 'rnaf' represents positional bias that arises in protocols using RNA fragmentation, and 'cdnaf' represents positional bias arising in protocols that use cDNA fragmentation (Li and Jiang 2012). Using the 'rnaf' model, coverage is higher in the middle of the transcript and lower at both ends, and in the 'cdnaf' model, coverage increases toward the 3' end of the transcript. The probability models used come from Supplementary Figure S3 of Li and Jiang (2012).

Value

DNAStringSet consisting of one randomly selected subsequence per element of tObj.

Details

The empirical fragment length distribution was estimated using 7 randomly selected RNA-seq samples from the GEUVADIS dataset ('t Hoen et al 2013), one sample from each laboratory that performed sequencing for that data set. We used Picard's "CollectInsertSizeMetrics" (http://broadinstitute.github.io/picard/), version 1.121, to estimate the insert size distribution based on the read alignments.

References

't Hoen PA, et al (2013): Reproducibility of high-throughput mRNA and small RNA sequencing across laboratories. Nature Biotechnology 31(11): 1015-1022.

Li W and Jiang T (2012): Transcriptome assembly and isoform expression level estimation from biased RNA-Seq reads. Bioinformatics 28(22): 2914-2921.

See Also

logspline

Examples

Run this code
library(Biostrings)
  data(srPhiX174)

  ## get fragments with lengths drawn from normal distrubution
  set.seed(174)
  srPhiX174_fragments = generate_fragments(srPhiX174, fraglen=15, fragsd=3,
      readlen=4)
  srPhiX174_fragments
  srPhiX174

  ## get fragments with lengths drawn from an empirical distribution
  empirical_frags = generate_fragments(srPhiX174, distr='empirical')
  empirical_frags

  ## get fragments with lengths from a normal distribution, but include
  ## positional bias from cDNA fragmentation:
  biased_frags = generate_fragments(srPhiX174, bias='cdnaf')
  biased_frags

Run the code above in your browser using DataLab