propagate: General error propagation function using covariates and Monte Carlo simulations

Description

A general function for calculating the propagated error. This can be calculated for any given function from either replicates or from statistical summary data (mean & standard deviation). If independency of variables cannot be assumed, covariates can be calculated from the replicated data or by supplying a variance-covariance matrix to the input. The resulting distribution of the propagated error can be evaluated by Monte Carlo simulation of the input by generating at least 5000 datapoints from the (normal or multivariate normal) distribution of the variables. The unpropagated error is also supplied by calculating the standard deviation of all evaluated expressions on the replicate raw data or the simulated data.

Usage

propagate(expr, data, type = c("raw", "stat"), cov = FALSE, do.sim = FALSE, 
	  cov.sim = FALSE, nsim = 10000)

Arguments

expr

an expression, such as expression(x/y).

data

a dataframe or matrix containing either a) the replicates in columns or b) the means in the first row and the standard deviations in the second row. The variable names must be defined in the column headers.

type

Either raw if replicates are given, or stat if means and standard deviations are supplied.

cov

logical or variance-covariance matrix with the same column descriptions and column order as vals. If TRUE together with replicates, the covariances are calculated from these. If type = "stat", a square varian

do.sim

logical. Should Monte Carlo simulation be applied to the propagated error?

cov.sim

logical. If TRUE, the simulated data is generated with a multivariate covariance structure similar to the input variables.

nsim

the number of simulations to be performed, minimum is 5000.

Value

A list with the following components:
evalExprthe result from the evaluated expression, obtained from the input variables.
evalSimthe average of the evaluated expression obtained from the Monte Carlo simulation.
errPropthe propagated error (s.d.), obtained from the input variables.
errPropSimthe average of the propagated errors (s.d.) obtained from the Monte Carlo simulation.
errEvalthe error (s.d.) of all evaluated expressions on the input data (unpropagated), if raw data is supplied.
errEvalSimthe error (s.d.) of the evaluated expressions obtained from Monte Carlo simulation. Can be used when errVec indicates strong deviance from normality. See 'Details'.
derivsa list with each of the partial derivatives as items.
covMatthe square variance-covariance matrix that was used for the calculations.
errVecthe propagated errors from the simulation coerced into a vector with length nsim. To be used for subsequent analysis, i.e tests for normality.
evalVecthe evaluated expressions from the simulation coerced into a vector with length nsim.

Details

The propagated error is calculated by gaussian error propagation. Often omitted, but important in models where the variables are dependent (i.e. linear regression), is the second covariance term. $$\sigma_Y^2 = \sum_{i}\left(\frac{\partial f}{\partial X_i} \right)^2 \sigma_i^2 + \sum_{i \neq j}\sum_{j \neq i}\left(\frac{\partial f}{\partial X_i} \right)\left(\frac{\partial f}{\partial X_j} \right) \sigma_{ij}$$ propagate calculates the propagated error either with or without covariances by using the matrix representation $$C_Y = F_XC_XF_X^T$$ with $C_Y$ = the propagated error, $F_X$ = the p x n matrix with the results from the partial derivatives, $C_X$ = the p x p variance-covariance matrix and $F_X^T$ = the n x p transposed matrix of $F_X$. Depending on the input formula, the error propagation may result in an error that is not normally distributed. The Monte Carlo simulation, starting with the distributions of the variables, can clarify this. For acquisition of even more realistic estimates from the simulation process, a multivariate dataset having the same covariance structure can be created. A high tendency from deviation of normality is encountered in formulae where the error of the denominator is relatively high or in exponential models where the base has a high error. This is one of the problems that is inherent in real-time PCR analysis, as the classical ratio calculation with efficiencies (i.e. by the delta-ct method) is usually of the exponential type. The user of this function will find that within ratio calculations the propagated error is often non-normal. If this is the case, errEvalSim obtained as the s.d. of all simulated expression evaluations might be more sensible (see 'Examples').

References

Error propagation (in general): Taylor JR (1996). An Introduction to error analysis. University Science Books, New York. A very good technical paper describing error propagation by matrix calculation can be found under www.nada.kth.se/~kai-a/papers/arrasTR-9801-R3.pdf. Error propagation (in qPCR): Nordgard O et al. (2006). Error propagation in relative real-time reverse transcription polymerase chain reaction quantification models: The balance between accuracy and precision. Analytical Biochemistry, 356, 182-193. Hellemans J et al. (2007). qBase relative quantification framework and software for management and analysis of real-time quantitative PCR data. Genome Biology, 8: R19. Multivariate normal distribution: Ripley BD (1987). Stochastic Simulation. Wiley. Page 98. Testing for normal distribution: Thode Jr. HC (2002). Testing for Normality. Marcel Dekker, New York. Royston P (1992). Approximating the Shapiro-Wilk W-test for non-normality. Statistics and Computing, 2, 117-119.

Examples

Run this code

## from summary data just calculate the propagated error.
EXPR <- expression(x/y)
x <- c(5, 0.1)
y <- c(1, 0.01)
DF <- cbind(x, y)
propagate(EXPR, DF, type = "stat")

## Do simulations and evaluate error distribution.
res <- propagate(EXPR, DF, type = "stat", cov = TRUE, do.sim = TRUE)
## Do Shapiro-Wilks test on simulated errors 
## !maximum 5000 datapoints can be used!
## => p.value indicates weak deviation from normality
shapiro.test(res$errVec[1:5000])
## How about a graphical analysis:
qqnorm(res$errVec)
## histogram indicates no or weak skewness or kurtosis
hist(res$errVec, nclass = 100)

## using raw data
## bring all vectors to same length using 'NA'
## so that values are not recycled!
EXPR <- expression(x*y)
x <- c(2, 2.1, 2.2, 2, 2.3, 2.1)
y <- c(4, 4, 3.8, 4.1, NA, NA)
DF <- cbind(x, y)  
propagate(EXPR, DF, type = "raw")

## For replicate data, using relative quantification ratio from qPCR.
## How good is the estimation of the propagated error?
## Done without using covariance in the calculation and simulation.
## STRONG deviation from normality!
EXPR <- expression((E1^cp1)/(E2^cp2))
E1 <- c(1.73, 1.75, 1.77)
cp1 <- c(25.77,26.14,26.33)
E2 <-  c(1.72,1.68,1.65)
cp2 <- c(33.84,34.04,33.33)
DF <- cbind(E1, cp1, E2, cp2)
res <- propagate(EXPR, DF, type = "raw", do.sim = TRUE)
shapiro.test(res$errVec[1:5000])
qqnorm(res$errVec)
## Maybe use the error of all simulated expression
## evaluations instead? No, not better...
qqnorm(res$evalVec)

## Using covariances in calculation and simulation
res <- propagate(EXPR, DF, type = "raw", cov = TRUE, 
		 do.sim = TRUE, cov.sim = TRUE)

## Error propagation in a linear model using the covariance matrix from summary.lm
## Estimate error of y for x=7
x <- 1:10	
set.seed(123)
y <- x + rnorm(10, 0, 1) ##generate random data	
mod <- lm(y ~ x)
summ <- summary(mod)
DF <- t(coef(summ)[, 1:2]) ## make matrix of parameter estimates and standard error
colnames(DF) <- c("b", "m")
CM <- vcov(mod) ## take var-cov matrix
colnames(CM) <- c("b", "m")
propagate(expression(m*7 + b), DF, type = "stat", cov = CM)

## In a x/y regime, when does the propagated error start to
## deviate from normality if error of denominator increases?
## Watch skewness of histogram!
x <- c(5, 1)
for (i in seq(0.001, 0.2, by = 0.001)) {
      y <- c(1, i)
      DF <- cbind(x, y)
      res  <-  propagate(expression(x/y), DF, type = "stat", do.sim = TRUE)
      hist(res$errVec, nclass = 100, main = paste("sd(y):",i))      
}

Run the code above in your browser using DataLab