Learn R Programming

EBSeq Q & A

Read in data

csv file:

In=read.csv("FileName", stringsAsFactors=F, row.names=1, header=T)

Data=data.matrix(In)

txt file:

In=read.table("FileName", stringsAsFactors=F, row.names=1, header=T)

Data=data.matrix(In)

check str(Data) and make sure it is a matrix instead of data frame. You may need to play around with the row.names and header option depends on how the input file was generated.

GetDEResults() function not found

You may on an earlier version of EBSeq. The GetDEResults function was introduced since version 1.7.1. The latest release version could be found at: http://www.bioconductor.org/packages/devel/bioc/html/EBSeq.html And you may check your package version by typing packageVersion("EBSeq")

Visualizing DE genes/isoforms

To generate a heatmap, you may consider the heatmap.2 function in gplots package. For example, you may run

heatmap.2(NormalizedMatrix[GenesOfInterest,], scale="row", trace="none", Colv=F)

The normalized matrix may be obtained from GetNormalizedMat() function.

My favorite gene/isoform has NA in PP (status "NoTest")

The NoTest status comes from two sources

  1. Using the default parameter settings of EBMultiTest(), the function

will not test on genes with more than 75% values < 10 to ensure better model fitting. To disable this filter, you may set Qtrm=1 and QtrmCut=0.

  1. numerical over/underflow in R. That happens when the within

condition variance is extremely large or small. I did implemented a numerical approximation step to calculate the approximated PP for these genes with over/underflow. Here I use 10^-10 to approximate the parameter p in the NB distribution for these genes (I set it to a small value since I want to cover more over/underflow genes with low within-condition variation). You may try to tune this value (to a larger value) in the approximation by setting ApproxVal in EBTest() or EBMultiTest() function.

Can I run more than 5 iterations when running EBSeq via RSEM wrapper?

Yes you may modify the script rsem-for-ebseq-find-DE under RSEM/EBSeq change line 36

EBOut <- EBTest(Data = DataMat, NgVector = ngvector, Conditions =
conditions, sizeFactors = Sizes, maxround = 5)

to

EBOut <- EBTest(Data = DataMat, NgVector = ngvector, Conditions =
conditions, sizeFactors = Sizes, maxround = 10)

If you are running multiple condition analysis, you will need to change line 53:

MultiOut <- EBMultiTest(Data = DataMat, NgVector = ngvector,
Conditions = conditions, AllParti = patterns, sizeFactors = Sizes,
maxround = 5)
MultiOut <- EBMultiTest(Data = DataMat, NgVector = ngvector,
Conditions = conditions, AllParti = patterns, sizeFactors = Sizes,
maxround = 10)

You will need to redo make after you make the changes.

I saw a gene has significant FC but is not called as DE by EBSeq, why does that happen?

EBSeq calls a gene as DE (assign high PPDE) if the across-condition variability is significantly larger than the within-condition variability. In the cases that a gene has large within-condition variation, although the FC across two conditions is large (small), the across-condition difference could still be explained by biological variation within condition. In these cases the gene/isoform will have a moderate PPDE.

Can I look at TPMs/RPKMs/FPKMs across samples?

In general, it is not appropriate to perform cross sample comparisons using TPM, FPKM or RPKM without further normalization. Instead, you may use normalized counts (It can be generated by GetNormalizedMat() function from raw count, note EBSeq testing functions takes raw counts and library size factors)

Here is an example:

Suppose there are 2 samples S1 and S2 from different conditions. Each has 5 genes. For simplicity, we assume each of 5 genes contains only one isoform and all genes have the same length.

Assume only gene 5 is DE and the gene expressions of these 5 genes are:

Sampleg1g2g3g4g5
S11010101010
S220202020100

Then the TPM/FPKM/RPKM will be (note sum TPM/FPKM/RPKM of all genes should be 10^6 ):

Sampleg1g2g3g4g5
S12x10^52x10^52x10^52x10^52x10^5
S21.1x10^51.1x10^51.1x10^51.1x10^55.6x10^5

Based on TPM/FPKM/RPKM, an investigator may conclude that the first 4 genes are down-regulated and the 5th gene is up-regulated. Then we will get 4 false positive calls.

Cross-sample TPM/FPKM/RPKM comparisons will be feasible only when no hypothetical DE genes present across samples (Or when assuming the DE genes are sort of 'symmetric' regarding up and down regulation).

RealFC vs PostFC

The posterior fold change estimations will give less extreme values for low expressers. e.g. if gene1 has mean1 = 5000 and mean2 = 1000, its FC and PostFC will both be 5. If gene2 has mean1 = 5 and mean2 = 1, its FC will be 5 but its PostFC will be < 5 and closer to 1. Therefore when we sort the PostFC, gene2 will be less significant than gene1.

Copy Link

Version

Version

1.12.0

License

Artistic-2.0

Maintainer

Ning Leng

Last Published

February 15th, 2017

Functions in EBSeq (1.12.0)

EBSeq_NingLeng-package

EBSeq: RNA-Seq Differential Expression Analysis on both gene and isoform level
GetNg

Ng Vector
GetMultiFC

Calculate the Fold Changes for Multiple Conditions
EBHMMNBMultiEM_2chain

Run EBSeqHMM model with a fixed expected FC
LikefunMulti

Likelihood Function of the NB-Beta Model In Multiple Condition Test
LogNMulti

EM algorithm for the NB-beta model in the multiple condition test
f0

The Prior Predictive Distribution of being EE
LogN

The function to run EM (one round) algorithm for the NB-beta model.
MedianNorm

Median Normalization
QuantileNorm

Quantile Normalization
EBSeqHMM-package

EBSeqHMM: A Bayesian approach for identifying gene-expression changes in ordered RNA-seq experiments
GetDEResults

Obtain Differential Expression Analysis Results in a Two-condition Test
PlotPostVsRawFC

Plot Posterior FC vs FC
GetMultiPP

Posterior Probability of Each Transcript
GetAllPaths

Obtain all possible gene paths for an RNA-seq experiments with ordered conditions
IsoList

The simulated data for two condition isoform DE analysis
beta.mom

Fit the beta distribution by method of moments
LikefunNBHMM

Likelihood function of the Beta-Negative Binomial HMM Model
GeneExampleData

Simulated gene level data set with 5 ordered conditions
PolyFitPlot

Fit the mean-var relationship using polynomial regression
GetPatterns

Generate all possible patterns in a multiple condition study
GetNormalizedMat

Calculate normalized expression matrix
GetPP

Generate the Posterior Probability of each transcript.
EBSeqHMMTest

Identify DE genes and classify them into their most likely path in an RNA-seq experiment with ordered conditions
GetPPMat

Posterior Probability of Transcripts
Likefun

Likelihood Function of the NB-Beta Model
PlotPattern

Visualize the patterns
EBHMMNBfun

Baum-Welch algorithm for a single hidden markov chain
EBTest

Using EM algorithm to calculate the posterior probabilities of being DE
MultiGeneMat

The simulated data for multiple condition gene DE analysis
f1

The Prior Predictive Distribution of being DE
IsoMultiList

The simulated data for multiple condition isoform DE analysis
crit_fun

Calculate the soft threshold for a target FDR
GetDECalls

Obtain DE gene/isoform list at a certain FDR
IsoExampleList

Simulated isoform level data set with 5 ordered conditions
GetConfidentCalls

Obtain confident gene calls for classifying genes into expression paths
GeneMat

The simulated data for two condition gene DE analysis
DenNHist

Density plot to compare the empirical q's and the simulated q's from the fitted beta distribution.
PostFC

Calculate the posterior fold change for each transcript across conditions
EBTest_ext

Extented EBTest function
PlotExp

Plot expression of a single gene
EBMultiTest

Using EM algorithm to calculate the posterior probabilities of interested patterns in a multiple condition study
QQP

The Quantile-Quantile Plot to compare the empirical q's and simulated q's from fitted beta distribution
RankNorm

Rank Normalization
EBHMMNBfunForMulti

Baum-Welch algorithm for multiple hidden markov chains