fitBiasModels: Fit bias models over single-isoform genes

Description

This function estimates parameters for one or more bias models for a single sample over a set of single-isoform genes. ~100 medium to highly expressed genes should be sufficient to estimate the parameters robustly.

Usage

fitBiasModels(genes, bam.file, fragtypes, genome, models, readlength, minsize, maxsize, speedglm = TRUE)

Arguments

genes

a GRangesList with the exons of different single-isoform genes

bam.file

a character string pointing to an indexed BAM file

fragtypes

the output of buildFragtypes. must contain the potential fragment types for the genes named in genes

genome

a BSgenome object

models

a list of lists: the outer list describes multiple models each element of the inner list has two elements: formula and offset. formula should be a character strings of an R formula describing the bias models, e.g. "count ~ ns(gc) + gene". offset should be a character vector listing possible bias offsets to be used ("fraglen" or "vlmm"). Either offset or formula can be NULL for a model. See vignette for recommendations and details.

readlength

the read length

minsize

the minimum fragment length to model

maxsize

the maximum fragment length to model

speedglm

logical, whether to use speedglm to estimate the coefficients. Default is TRUE.

Value

a list with elements: coefs, summary, models, and optional offets: fraglen.density, vlmm.fivep, and vlmm.threep.

coefs gives the estimated coefficients for the different models that specified formula.
summary gives the tables with coefficients, standard errors and p-values,
models stores the incoming models list,
fraglen.density is a estimated density object for the fragment length distribution,
vlmm.fivep and vlmm.threep store the observed and expected tabulations for the different orders of the VLMM for read start bias.

References

The complete bias model including fragment sequence bias:

Love, M.I., Hogenesch, J.B., and Irizarry, R.A., Modeling of RNA-seq fragment sequence bias reduces systematic errors in transcript abundance estimation. bioRxiv (2015) doi: 10.1101/025767

The read start VLMM:

Roberts, A., Trapnell, C., Donaghey, J., Rinn, J.L., and Pachter, L., Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biology (2011) doi: 10.1186/gb-2011-12-3-r22

Examples

Run this code


# see vignette for a more realistic example

# these next lines just write out a BAM file from R
# typically you would already have a BAM file
library(alpineData)
library(GenomicAlignments)
library(rtracklayer)
gap <- ERR188088()
dir <- system.file(package="alpineData", "extdata")
bam.file <- c("ERR188088" = file.path(dir,"ERR188088.bam"))
export(gap, con=bam.file)

library(GenomicRanges)
library(BSgenome.Hsapiens.NCBI.GRCh38)
data(preprocessedData)

readlength <- 100
minsize <- 125 # see vignette how to choose
maxsize <- 175 # see vignette how to choose

# here a very small subset, should be ~100 genes
gene.names <- names(ebt.fit)[6:8] 
names(gene.names) <- gene.names
fragtypes <- lapply(gene.names, function(gene.name) {
                      buildFragtypes(ebt.fit[[gene.name]],
                                     Hsapiens, readlength,
                                     minsize, maxsize)
})
models <- list(
  "GC" = list(formula = "count ~ ns(gc,knots=gc.knots,
                Boundary.knots=gc.bk) + gene",
              offset=c("fraglen"))
)

fitpar <- fitBiasModels(genes=ebt.fit[gene.names],
                        bam.file=bam.file,
                        fragtypes=fragtypes,
                        genome=Hsapiens,
                        models=models,
                        readlength=readlength,
                        minsize=minsize,
                        maxsize=maxsize)

Run the code above in your browser using DataLab