polyester (version 1.8.3)

simulate_experiment: simulate RNA-seq experiment using negative binomial model


create FASTA files containing RNA-seq reads simulated from provided transcripts, with optional differential expression between two groups


simulate_experiment(fasta = NULL, gtf = NULL, seqpath = NULL, num_reps = 10, fraglen = 250, fragsd = 25, readlen = 100, error_rate = 0.005, paired = TRUE, reads_per_transcript = 300, fold_changes, size = NULL, outdir = ".", write_info = TRUE, transcriptid = NULL, seed = NULL, ...)


path to FASTA file containing transcripts from which to simulate reads. See details.
path to GTF file containing transcript structures from which reads should be simulated. See details.
path to folder containing one FASTA file (.fa extension) for each chromosome in gtf. See details.
How many biological replicates should be in each group? If num_reps is a single integer, num_reps replicates will be simulated in each group. Otherwise, num_reps can be a length-2 vector, where num_reps[1] and num_reps[2] replicates will be simulated in each of the two groups.
Mean RNA fragment length. Sequences will be read off the end(s) of these fragments.
Standard deviation of fragment lengths.
Read length.
Sequencing error rate. Must be between 0 and 1. A uniform error model is assumed.
If TRUE, paired-end reads are simulated; else single-end reads are simulated.
baseline mean number of reads to simulate from each transcript. Can be an integer, in which case this many reads are simulated from each transcript, or an integer vector whose length matches the number of transcripts in fasta.
Vector of multiplicative fold changes between groups, one entry per transcript in fasta. A fold change > 1 means the transcript is overexpressed in the first num_reps (or num_reps[1]) samples. Fold change < 1 means transcript is overexpressed in the last num_reps (or num_reps[2]) samples. The change is in the mean number of reads generated from the transcript, between groups.
the negative binomial size parameter (see NegBinomial) for the number of reads drawn per transcript. If left blank, defaults to reads_per_transcript / 3. Negative binomial variance is mean + mean^2 / size. Can either be left at default, a vector of the same length as number of transcripts in fasta, if the two groups should have the same size parameters, or a list with 2 elements, each of which is a vector with length equal to the number of transcripts in fasta, which represent the size parameters for each transcript in groups 1 and 2, respectively.
character, path to folder where simulated reads should be written, with *no* slash at the end. By default, reads are written to current working directory.
If TRUE, write a file matching transcript IDs to differential expression status into the file outdir/sim_info.txt.
optional vector of transcript IDs to be written into sim_info.txt and used as transcript identifiers in the fasta files. Defaults to names(readDNAStringSet(fasta)). This option is useful if default names are very long or contain special characters.
Optional seed to set before simulating reads, for reproducibility.
additional arguments to pass to seq_gtf if using gtf and seqpath


No return, but simulated reads and a simulation info file are written to outdir.


Reads can either be simulated from a FASTA file of transcripts (provided with the fasta argument) or from a GTF file plus DNA sequences (provided with the gtf and seqpath arguments). Simulating from a GTF file and DNA sequences may be a bit slower: it took about 6 minutes to parse the GTF/sequence files for chromosomes 1-22, X, and Y in hg19.


Run this code

  ## simulate a few reads from chromosome 22

  fastapath = system.file("extdata", "chr22.fa", package="polyester")
  numtx = count_transcripts(fastapath)
  fold_changes = sample(c(0.5, 1, 2), size=numtx,
     prob=c(0.05, 0.9, 0.05), replace=TRUE)
  # remove quotes from transcript IDs:
  tNames = gsub("'", "", names(readDNAStringSet(fastapath)))

  simulate_experiment(fastapath, reads_per_transcript=10,
     fold_changes=fold_changes, outdir='simulated_reads',
     transcriptid=tNames, seed=12)

Run the code above in your browser using DataCamp Workspace