generate_fragments: generate a set of fragments from a set of transcripts

Description

Convert each sequence in a DNAStringSet to a "fragment" (subsequence)

Usage

generate_fragments(tObj, fraglen = 250, fragsd = 25, readlen = 100, distr = "normal", custdens = NULL, bias = "none")

Arguments

tObj

DNAStringSet of sequences from which fragments should be extracted

fraglen

Mean fragment length, if drawing fragment lengths from a normal distribution.

fragsd

Standard deviation of fragment lengths, if drawing lengths from a normal distribution. Note: fraglen and fragsd are ignored unless distr is 'normal'.

readlen

Read length. Default 100. Used only to label read positions.

distr

One of 'normal', 'empirical', or 'custom'. If 'normal', draw fragment lengths from a normal distribution with mean fraglen and standard deviation fragsd. If 'empirical', draw fragment lengths from a fragment length distribution estimated from a real data set. If 'custom', draw fragment lengths from a custom distribution, provided as the custdens argument, which should be a density fitted using logspline.

custdens

If distr is 'custom', draw fragments from this density. Should be an object of class logspline.

bias

One of 'none', 'rnaf', or 'cdnaf' (default 'none'). 'none' represents uniform fragment selection (every possible fragment in a transcript has equal probability of being in the experiment); 'rnaf' represents positional bias that arises in protocols using RNA fragmentation, and 'cdnaf' represents positional bias arising in protocols that use cDNA fragmentation (Li and Jiang 2012). Using the 'rnaf' model, coverage is higher in the middle of the transcript and lower at both ends, and in the 'cdnaf' model, coverage increases toward the 3' end of the transcript. The probability models used come from Supplementary Figure S3 of Li and Jiang (2012).

Value

DNAStringSet consisting of one randomly selected subsequence per element of tObj.

Details

The empirical fragment length distribution was estimated using 7 randomly selected RNA-seq samples from the GEUVADIS dataset ('t Hoen et al 2013), one sample from each laboratory that performed sequencing for that data set. We used Picard's "CollectInsertSizeMetrics" (http://broadinstitute.github.io/picard/), version 1.121, to estimate the insert size distribution based on the read alignments.

References

't Hoen PA, et al (2013): Reproducibility of high-throughput mRNA and small RNA sequencing across laboratories. Nature Biotechnology 31(11): 1015-1022.

Li W and Jiang T (2012): Transcriptome assembly and isoform expression level estimation from biased RNA-Seq reads. Bioinformatics 28(22): 2914-2921.

Examples

Run this code

library(Biostrings)
  data(srPhiX174)

  ## get fragments with lengths drawn from normal distrubution
  set.seed(174)
  srPhiX174_fragments = generate_fragments(srPhiX174, fraglen=15, fragsd=3,
      readlen=4)
  srPhiX174_fragments
  srPhiX174

  ## get fragments with lengths drawn from an empirical distribution
  empirical_frags = generate_fragments(srPhiX174, distr='empirical')
  empirical_frags

  ## get fragments with lengths from a normal distribution, but include
  ## positional bias from cDNA fragmentation:
  biased_frags = generate_fragments(srPhiX174, bias='cdnaf')
  biased_frags

Run the code above in your browser using DataLab