runJunctionSeqAnalyses: Run a JunctionSeq analysis.

Description

This function runs a complete analysis from start to finish. It internally calls functions readAnnotationData, readJunctionSeqCounts, estimateJunctionSeqSizeFactors, estimateJunctionSeqDispersions, fitJunctionSeqDispersionFunction, testForDiffUsage, and estimateEffectSizes.

Usage

runJunctionSeqAnalyses(sample.files, sample.names, condition,  flat.gff.file,  analysis.type = c("junctionsAndExons","junctionsOnly","exonsOnly"), meanCountTestableThreshold = "auto", nCores = 1, use.covars,  test.formula0 = formula(~ sample + countbin),  test.formula1 = formula(~ sample + countbin + condition : countbin), effect.formula = formula(~ condition + countbin + condition : countbin), geneLevel.formula = formula(~ condition), use.exons, use.junctions,  use.known.junctions = TRUE,  use.novel.junctions = TRUE,  use.multigene.aggregates = FALSE,  gene.names,  method.GLM = c(c("advanced","DESeq2-style"),  c("simpleML","DEXSeq-v1.8.0-style")), method.dispFit = c("parametric", "local", "mean"),  method.dispFinal = c("shrink","max","fitted","noShare"), method.sizeFactors = c("byGenes","byCountbins"), method.countVectors = c("geneLevelCounts","sumOfAllBinsForGene", "sumOfAllBinsOfSameTypeForGene"), method.expressionEstimation = c("feature-vs-gene", "feature-vs-otherFeatures"), method.cooksFilter = TRUE, optimizeFilteringForAlpha = 0.01, fitDispersionsForExonsAndJunctionsSeparately = TRUE, keep.hypothesisTest.fit = FALSE, keep.estimation.fit = FALSE, replicateDEXSeqBehavior.useRawBaseMean = FALSE, verbose = TRUE, debug.mode = FALSE)

Arguments

sample.files

Character vector. The filenames of the count files generated by QoRTs. The counts must all be generated using equivalent QoRTs parameters. The strandedness must be the same, as well as the inclusion of novel junctions.

sample.names

A character vector of sample names. This must have the same length as sample.files, and should be in the same order.

condition

A factor vector of condition values. This must have the same length as sample.files and sample.names, and should be listed in the same order.

flat.gff.file

A flattened gff-formatted annotation file from which the gene counts were generated. Technically optional, but STRONGLY RECOMMENDED, as the annotation data WILL be required by plotting functions.

analysis.type

Character string. One of "junctionsAndExons", "junctionsOnly", or "exonsOnly". This parameter determines what type of analysis is to be performed. By default JunctionSeq tests both splice junction loci and exonic regions for differential usage (a "hybrid" analysis). This parameter can be used to limit analyses specifically to either splice junction loci or exonic regions.

meanCountTestableThreshold

"auto" or Numeric value. Features with a total mean normalized count of less than this value will be excluded from the analyses. If left as the default ("auto"), then the cutoff threshold will be determined automatically using the DESeq2 independent filtering method.

nCores

Either an integer or a BiocParallelParam object. Either way, this determines The number of cores to use. Note that multicore functionality may not be available on all platforms. If parallel execution is not available then JunctionSeq will automatically fallback to single-core execution. See the BiocParallel package for more information.

use.covars

Optional: for advanced users. A data frame containing covariate factors. The names must be included in the model formulas.

test.formula0

For advanced users. The base formula for the null hypothesis model used in the hypothesis tests. NOTE: the biological condition to be tested must be named "condition".

test.formula1

For advanced users. The base formula for the alternate hypothesis model used in the hypothesis tests. NOTE: the biological condition to be tested must be named "condition".

effect.formula

For advanced users. The base formula for the model used for effect size estimation. NOTE: the biological condition to be tested must be named "condition".

geneLevel.formula

For advanced users. The base formula for the model used to estimate total gene-level expression. NOTE: the biological condition to be tested must be named "condition".

use.exons

Logical value. This is an alternate parameterization of the analysis.type parameter. If TRUE, then exonic region loci will be included in the analyses and will be tested for differential usage. If this parameter is set, then parameter use.junctions must also be set.

use.junctions

Logical value. This is an alternate parameterization of the analysis.type parameter. If TRUE, then splice junction loci will be included in the analyses and will be tested for differential usage. If this parameter is set, then parameter use.exons must also be set.

use.known.junctions

Logical value. If TRUE, then known splice junctions will not be filtered out prior to analysis. Note: this is overidden if use.junctions is FALSE or if analysis.type is set to "exonsOnly".

use.novel.junctions

Logical value. If TRUE, then novel splice junctions will not be filtered out prior to analysis. Note: this is overidden if use.junctions is FALSE or if analysis.type is set to "exonsOnly".

use.multigene.aggregates

Logical value. Whether to attempt to test "aggregate genes" which consist of multiple genes that overlap with one another. Note that inclusion of aggregate genes may affect the false discovery rate, since by their very nature aggregate genes will often show differential splice junction usage, as the two genes will often be regulated independently.

gene.names

data.frame. This optional parameter can be used to decoder the gene id's used in the actual analysis into gene symbols or gene names for general readability. This must be a text file name or data.frame with two columns of character strings. The first must be the gene ID's, and the second must be the gene names (as you wish them to appear in the plots). Genes are allowed to have multiple gene names, in which case they will be separated by commas. The gene names will be used in the plots and figures.

method.GLM

Character string. Can be used to apply alternative methodologies or implementations. Intended for advanced users who have strong opinions about the underlying statistical methodologies. The default is "advanced" or, equivalently, "DESeq2-style". This uses the dispersion estimation methodology used by DESeq2 and DEXSeq v1.12.0 or higher to generate the initial (feature-specific) dispersion estimates. The alternative method is "simpleML" or, equivalently, "DEXSeq-v1.8.0-style". This uses a simpler maximum-likelihood-based method used by the original DESeq and by DEXSeq v1.8.0 or less.

method.dispFit

Character string. Can be used to apply alternative methodologies or implementations. Intended for advanced users who have strong opinions about the underlying statistical methodologies. Determines the method used to generated "fitted" dispersion estimates. One of "parametric" (the default), "local", or "mean". See the DESeq2 documentation for more information.

method.dispFinal

Character string. Can be used to apply alternative methodologies or implementations. Intended for advanced users who have strong opinions about the underlying statistical methodologies.

Determines the method used to arrive at a "final" dispersion estimate. The default, "shrink" uses the maximum a posteriori estimate, combining information from both the fitted and feature-specific dispersion estimates. This is the method used by DESeq2 and DEXSeq v1.12.0 and above.

method.sizeFactors

Character string. Can be used to apply alternative methodologies or implementations. Intended for advanced users who have strong opinions about the underlying statistical methodologies.

Determines the method used to calculate normalization size factors. By default JunctionSeq uses gene-level expression. As an alternative, feature-level counts can be used as they are in DEXSeq. In practice the difference is almost always negligible.

method.countVectors

Character string. Can be used to apply alternative methodologies or implementations. Intended for advanced users who have strong opinions about the underlying statistical methodologies.

Determines the type of count vectors to be used in the model framework. By default JunctionSeq compares the counts for a specific feature against the counts across the rest of the gene minus the counts for the specific feature. Alternatively, the sum of all other features on the gene can be used, like in DEXSeq. The advantage to the default JunctionSeq behavior is that no read or read-pair is ever counted more than once in any model. Under DEXSeq, some reads may cover many exonic segments and thus be counted repeatedly.

method.expressionEstimation

Character string. Can be used to apply alternative methodologies or implementations. Intended for advanced users who have strong opinions about the underlying statistical methodologies.

Determines the methodology used to generate feature expression estimates and relative fold changes. By default each feature is modeled separately. Under the default count-vector method, this means that the resultant relative fold changes will be a measure of the relative fold change between the feature and the gene as a whole. Alternatively, the "feature-vs-otherFeatures" method builds a large, complex model containing all features belonging to the gene. The coefficients for each feature are then "balanced" using linear contrasts weighted by the inverse of their variance. In general we have found this method to produce very similar results but less efficiently and less consistently. Additionally, this alternative method "multi-counts" reads that cover more than one feature. This can result in over-weighting of exonic regions with a large number of annotated variations in a small genomic area, as each individual read or read-pair may be counted many times in the model. Under the default option, no read or read-pair is ever counted more than once in a given model.

method.cooksFilter

Logical value. if TRUE, use the cook's filter to detect and remove outliers.

optimizeFilteringForAlpha

Numeric value between 0 and 1. If meanCountTestableThreshold is set to "auto" then this sets the adjusted-p-value threshold to optimize against.

fitDispersionsForExonsAndJunctionsSeparately

When running a "junctionsAndExons" type analysis in which both exons and splice junctions are being tested simultaniously, this parameter determines whether a single fitted dispersion model should be fitted for both exons and splice junctions, or if separate fitted dispersions should be calculated for each. By default the dispersions are run separately.

keep.hypothesisTest.fit

Logical value. If TRUE, save both complete hypothesis test model fits for every gene. This will require a lot of memory, but may be useful for statistical diagnostics. Default: FALSE.

keep.estimation.fit

Logical value. If TRUE, save the complete model fits for every gene. This will require a lot of memory, but may be useful for statistical diagnostics. Default: FALSE.

verbose

if TRUE, send debugging and progress messages to the console / stdout.

debug.mode

if TRUE, send even more debugging and progress messages to the console / stdout.

replicateDEXSeqBehavior.useRawBaseMean

USED ONLY FOR INTERNAL TESTING! NOT INTENDED FOR ACTUAL USE! This variable activates an alternative mode in which a (very minor) bug in DEXSeq v1.14.0 and earlier is replicated. If TRUE, the baseMean and baseVar variables will be computed using raw counts rather than normalized counts. This is used for internal tests in which DEXSeq functionality is replicated precisely and the results are compared against equivalent DEXSeq results. Without this option the results would differ slightly (generally by less than 1 hundreth of a percent). USED ONLY FOR INTERNAL TESTING! NOT INTENDED FOR ACTUAL USE!

Value

A JunctionSeqCountSet object, containing the complete analysis dataset and results.

Examples

Run this code

## Not run: 
# ########################################
# #Set up example data:
# decoder.file <- system.file(
#                   "extdata/annoFiles/decoder.bySample.txt",
#                   package="JctSeqData");
# decoder <- read.table(decoder.file,
#                   header=TRUE,
#                   stringsAsFactors=FALSE);
# gff.file <- system.file(
#             "extdata/cts/withNovel.forJunctionSeq.gff.gz",
#             package="JctSeqData");
# countFiles <- system.file(paste0("extdata/cts/",
#      decoder$sample.ID,
#      "/QC.spliceJunctionAndExonCounts.withNovel.forJunctionSeq.txt.gz"),
#      package="JctSeqData");
# ########################################
# 
# jscs <- runJunctionSeqAnalyses(sample.files = countFiles,
#            sample.names = decoder$sample.ID,
#            condition=factor(decoder$group.ID),
#            flat.gff.file = gff.file,
#            analysis.type = "junctionsAndExons"
# );
# 
# ## End(Not run)

Run the code above in your browser using DataLab