simulate_experiment: simulate RNA-seq experiment using negative binomial model

Description

create FASTA files containing RNA-seq reads simulated from provided transcripts, with optional differential expression between two groups

Usage

simulate_experiment(fasta = NULL, gtf = NULL, seqpath = NULL, num_reps = 10, fraglen = 250, fragsd = 25, readlen = 100, error_rate = 0.005, paired = TRUE, reads_per_transcript = 300, fold_changes, size = NULL, outdir = ".", write_info = TRUE, transcriptid = NULL, seed = NULL, ...)

Arguments

fasta

path to FASTA file containing transcripts from which to simulate reads. See details.

gtf

path to GTF file containing transcript structures from which reads should be simulated. See details.

seqpath

path to folder containing one FASTA file (.fa extension) for each chromosome in gtf. See details.

num_reps

How many biological replicates should be in each group? If num_reps is a single integer, num_reps replicates will be simulated in each group. Otherwise, num_reps can be a length-2 vector, where num_reps[1] and num_reps[2] replicates will be simulated in each of the two groups.

fraglen

Mean RNA fragment length. Sequences will be read off the end(s) of these fragments.

fragsd

Standard deviation of fragment lengths.

readlen

Read length.

error_rate

Sequencing error rate. Must be between 0 and 1. A uniform error model is assumed.

paired

If TRUE, paired-end reads are simulated; else single-end reads are simulated.

reads_per_transcript

baseline mean number of reads to simulate from each transcript. Can be an integer, in which case this many reads are simulated from each transcript, or an integer vector whose length matches the number of transcripts in fasta.

fold_changes

Vector of multiplicative fold changes between groups, one entry per transcript in fasta. A fold change > 1 means the transcript is overexpressed in the first num_reps (or num_reps[1]) samples. Fold change < 1 means transcript is overexpressed in the last num_reps (or num_reps[2]) samples. The change is in the mean number of reads generated from the transcript, between groups.

size

the negative binomial size parameter (see NegBinomial) for the number of reads drawn per transcript. If left blank, defaults to reads_per_transcript / 3. Negative binomial variance is mean + mean^2 / size. Can either be left at default, a vector of the same length as number of transcripts in fasta, if the two groups should have the same size parameters, or a list with 2 elements, each of which is a vector with length equal to the number of transcripts in fasta, which represent the size parameters for each transcript in groups 1 and 2, respectively.

outdir

character, path to folder where simulated reads should be written, with *no* slash at the end. By default, reads are written to current working directory.

write_info

If TRUE, write a file matching transcript IDs to differential expression status into the file outdir/sim_info.txt.

transcriptid

optional vector of transcript IDs to be written into sim_info.txt and used as transcript identifiers in the fasta files. Defaults to names(readDNAStringSet(fasta)). This option is useful if default names are very long or contain special characters.

seed

Optional seed to set before simulating reads, for reproducibility.

...

additional arguments to pass to seq_gtf if using gtf and seqpath

Value

No return, but simulated reads and a simulation info file are written to outdir.

Details

Reads can either be simulated from a FASTA file of transcripts (provided with the fasta argument) or from a GTF file plus DNA sequences (provided with the gtf and seqpath arguments). Simulating from a GTF file and DNA sequences may be a bit slower: it took about 6 minutes to parse the GTF/sequence files for chromosomes 1-22, X, and Y in hg19.

Examples

Run this code


  ## simulate a few reads from chromosome 22

  fastapath = system.file("extdata", "chr22.fa", package="polyester")
  numtx = count_transcripts(fastapath)
  set.seed(4)
  fold_changes = sample(c(0.5, 1, 2), size=numtx,
     prob=c(0.05, 0.9, 0.05), replace=TRUE)
  library(Biostrings)
  # remove quotes from transcript IDs:
  tNames = gsub("'", "", names(readDNAStringSet(fastapath)))

  simulate_experiment(fastapath, reads_per_transcript=10,
     fold_changes=fold_changes, outdir='simulated_reads',
     transcriptid=tNames, seed=12)

Run the code above in your browser using DataLab