corphylo: Correlations among Multiple Traits with Phylogenetic Signal

Description

This function calculates Pearson correlation coefficients for multiple continuous traits that may have phylogenetic signal, allowing users to specify measurement error as the standard error of trait values at the tips of the phylogenetic tree. Phylogenetic signal for each trait is estimated from the data assuming that trait evolution is given by a Ornstein-Uhlenbeck process. Thus, the function allows the estimation of phylogenetic signal in multiple traits while incorporating correlations among traits. It is also possible to include independent variables (covariates) for each trait to remove possible confounding effects. corphylo() returns the correlation matrix for trait values, estimates of phylogenetic signal for each trait, and regression coefficients for independent variables affecting each trait.

Usage

corphylo(X, U = list(), SeM = NULL, phy = NULL, REML = TRUE,
method = c("Nelder-Mead", "SANN"), constrain.d = FALSE, reltol = 10^-6,
maxit.NM = 1000, maxit.SA = 1000, temp.SA = 1, tmax.SA = 1, verbose = FALSE)
## S3 method for class 'corphylo':
print(x, digits = max(3, getOption("digits") - 3), ...)

Arguments

a n x p matrix with p columns containing the values for the n taxa. Rows of X should have rownames matching the taxon names in phy.

a list of p matrices corresponding to the p columns of X, with each matrix containing independent variables for the corresponding column of X. The rownames of each matrix within U must be the same as X, or alternatively, the order of values in rows must m

SeM

a n x p matrix with p columns containing standard errors of the trait values in X. The rownames of SeM must be the same as X, or alternatively, the order of values in rows must match those in X. If SeM is omitted, the trait values are assumed to be known

phy

a phylo object giving the phylogenetic tree. The rownames of phy must be the same as X, or alternatively, the order of values in rows must match those in X.

REML

whether REML or ML is used for model fitting.

method

in optim(), either Nelder-Mead simplex minimization or SANN (simulated annealing) minimization is used. If SANN is used, it is followed by Nelder-Mead minimization.

constrain.d

if constrain.d is TRUE, the estimates of d are constrained to be between zero and 1. This can make estimation more stable and can be tried if convergence is problematic. This does not necessarily lead to loss of generality of the results, because before u

reltol

a control parameter dictating the relative tolerance for convergence in the optimization; see optim().

maxit.NM

a control parameter dictating the maximum number of iterations in the optimization with Nelder-Mead minimization; see optim().

maxit.SA

a control parameter dictating the maximum number of iterations in the optimization with SANN minimization; see optim().

temp.SA

a control parameter dictating the starting temperature in the optimization with SANN minimization; see optim().

tmax.SA

a control parameter dictating the number of function evaluations at each temperature in the optimization with SANN minimization; see optim().

verbose

if TRUE, the model logLik and running estimates of the correlation coefficients and values of d are printed each iteration during optimization.

an objects of class corphylo.

digits

the number of digits to be printed.

...

arguments passed to and from other methods.

Value

An object of class "corphylo".
cor.matrixthe p x p matrix of correlation coefficients.
dvalues of d from the OU process for each trait.
Bestimates of the regression coefficients, including intercepts. Coefficients are named according to the list U. For example, B1.2 is the coefficient corresponding to U[[1]][, 2], and if column 2 in U[[1]] is named "colname2", then the coefficient will be B1.colname2. Intercepts have the form B1.0.
B.sestandard errors of the regression coefficients.
B.covcovariance matrix for regression coefficients.
B.zscoreZ scores for the regression coefficients.
B.pvaluetests for the regression coefficients being different from zero.
logLikhe log likelihood for either the restricted likelihood (REML = TRUE) or the overall likelihood (REML = FALSE).
AICAIC for either the restricted likelihood (REML = TRUE) or the overall likelihood (REML = FALSE).
BICBIC for either the restricted likelihood (REML = TRUE) or the overall likelihood (REML = FALSE).
REMLwhether REML is used rather than ML (TRUE or FALSE).
constrain.dwhether or not values of d were constrained to be between 0 and 1 (TRUE or FALSE).
XXvalues of X in vectorized form, with each trait X[, i] standardized to have mean zero and standard deviation one.
UUdesign matrix with values in UU corresponding to XX; each variable U[[i]][, j] is standardized to have mean zero and standard deviation one.
MMvector of measurement standard errors corresponding to XX, with the standard errors suitably standardized.
Vphythe phylogenetic covariance matrix computed from phy and standardized to have determinant equal to one.
Rcovariance matrix of trait values relative to the standardized values of XX.
Voverall estimated covariance matrix of residuals for XX including trait correlations, phylogenetic signal, and measurement error variances. This matrix can be used to simulate data for parametric bootstrapping. See examples.
Cmatrix V excluding measurement error variances.
convcodehe convergence code provided by optim().
niternumber of iterations performed by optim().

Details

For the case of two variables, the function estimates parameters for the model of the form, for example,

$$X[1] = B[1,0] + B[1,1] * u[1,1] + \epsilon[1]$$ $$X[2] = B[2,0] + B[2,1] * u[2,1] + \epsilon[2]$$ $$\epsilon ~ Gaussian(0, V)$$

where $B[1,0]$, $B[1,1]$, $B[2,0]$, and $B[2,1]$ are regression coefficients, and $V$ is a variance-covariance matrix containing the correlation coefficient r, parameters of the OU process $d1$ and $d2$, and diagonal matrices $M1$ and $M2$ of measurement standard errors for $X[1]$ and $X[2]$. The matrix $V$ is $2n x 2n$, with $n x n$ blocks given by

$$V[1,1] = C[1,1](d1) + M1$$ $$V[1,2] = C[1,2](d1,d2)$$ $$V[2,1] = C[2,1](d1,d2)$$ $$V[2,2] = C[2,2](d2) + M2$$

where $C[i,j](d1,d2)$ are derived from phy under the assumption of joint OU evolutionary processes for each trait (see Zheng et al. 2009). This formulation extends in the obvious way to more than two traits.

References

Zheng, L., A. R. Ives, T. Garland, B. R. Larget, Y. Yu, and K. F. Cao. 2009. New multivariate tests for phylogenetic signal and trait correlations applied to ecophysiological phenotypes of nine Manglietia species. Functional Ecology 23:1059--1069.

Examples

Run this code

## Simple example using data without correlations or phylogenetic
## signal. This illustrates the structure of the input data.

phy <- rcoal(10, tip.label = 1:10)
X <- matrix(rnorm(20), nrow = 10, ncol = 2)
rownames(X) <- phy$tip.label
U <- list(NULL, matrix(rnorm(10, mean = 10, sd = 4), nrow = 10, ncol = 1))
rownames(U[[2]]) <- phy$tip.label
SeM <- matrix(c(0.2, 0.4), nrow = 10, ncol = 2)
rownames(SeM) <- phy$tip.label

corphylo(X = X, SeM = SeM, U = U, phy = phy, method = "Nelder-Mead")

## Simulation example for the correlation between two variables. The
## example compares the estimates of the correlation coefficients from
## corphylo when measurement error is incorporated into the analyses with
## three other cases: (i) when measurement error is excluded, (ii) when
## phylogenetic signal is ignored (assuming a "star" phylogeny), and (iii)
## neither measurement error nor phylogenetic signal are included.

## In the simulations, variable 2 is associated with a single
## independent variable. This requires setting up a list U that has 2
## elements: element U[[1]] is NULL and element U[[2]] is a n x 1 vector
## containing simulated values of the independent variable.

# Set up parameter values for simulating data
n <- 50
phy <- rcoal(n, tip.label = 1:n)

R <- matrix(c(1, 0.7, 0.7, 1), nrow = 2, ncol = 2)
d <- c(0.3, .95)
B2 <- 1

Se <- c(0.2, 1)
SeM <- matrix(Se, nrow = n, ncol = 2, byrow = T)
rownames(SeM) <- phy$tip.label

# Set up needed matrices for the simulations
p <- length(d)

star <- stree(n)
star$edge.length <- array(1, dim = c(n, 1))
star$tip.label <- phy$tip.label

Vphy <- vcv(phy)
Vphy <- Vphy/max(Vphy)
Vphy <- Vphy/exp(determinant(Vphy)$modulus[1]/n)

tau <- matrix(1, nrow = n, ncol = 1)C <- matrix(0, nrow = p * n, ncol = p * n)
for (i in 1:p) for (j in 1:p) {
	Cd <- (d[i]^tau * (d[j]^t(tau)) * (1 - (d[i] * d[j])^Vphy))/(1 - d[i] * d[j])
	C[(n * (i - 1) + 1):(i * n), (n * (j - 1) + 1):(j * n)] <- R[i, j] * Cd
}
MM <- matrix(SeM^2, ncol = 1)
V <- C + diag(as.numeric(MM))

## Perform a Cholesky decomposition of Vphy. This is used to generate
## phylogenetic signal: a vector of independent normal random variables,
## when multiplied by the transpose of the Cholesky deposition of Vphy will
## have covariance matrix equal to Vphy.
iD <- t(chol(V))

# Perform Nrep simulations and collect the results
Nrep <- 100
cor.list <- matrix(0, nrow = Nrep, ncol = 1)
cor.noM.list <- matrix(0, nrow = Nrep, ncol = 1)
cor.noP.list <- matrix(0, nrow = Nrep, ncol = 1)
cor.noMP.list <- matrix(0, nrow = Nrep, ncol = 1)
d.list <- matrix(0, nrow = Nrep, ncol = 2)
d.noM.list <- matrix(0, nrow = Nrep, ncol = 2)
B.list <- matrix(0, nrow = Nrep, ncol = 3)
B.noM.list <- matrix(0, nrow = Nrep, ncol = 3)
B.noP.list <- matrix(0, nrow = Nrep, ncol = 3)
for (rep in 1:Nrep) {
	XX <- iD	X <- matrix(XX, nrow = n, ncol = 2)
	rownames(X) <- phy$tip.label

	U <- list(NULL, matrix(rnorm(n, mean = 2, sd = 10), nrow = n, ncol = 1))
	rownames(U[[2]]) <- phy$tip.label
	colnames(U[[2]]) <- "V1"
	X[,2] <- X[,2] + B2[1] * U[[2]][,1] - B2[1] * mean(U[[2]][,1])

	z <- corphylo(X = X, SeM = SeM, U = U, phy = phy, method = "Nelder-Mead")
	z.noM <- corphylo(X = X, U = U, phy = phy, method = "Nelder-Mead")
	z.noP <- corphylo(X = X, SeM = SeM, U = U, phy = star, method = "Nelder-Mead")

	cor.list[rep] <- z$cor.matrix[1, 2]
	cor.noM.list[rep] <- z.noM$cor.matrix[1, 2]
	cor.noP.list[rep] <- z.noP$cor.matrix[1, 2]
	cor.noMP.list[rep] <- cor(cbind(lm(X[,1] ~ 1)$residuals, lm(X[,2] ~ U[[2]])$residuals))[1,2]

	d.list[rep, ] <- z$d
	d.noM.list[rep, ] <- z.noM$d

	B.list[rep, ] <- z$B
	B.noM.list[rep, ] <- z.noM$B
	B.noP.list[rep, ] <- z.noP$B

	show(c(rep, z$convcode, z$cor.matrix[1, 2], z$d))
}
correlation <- rbind(R[1, 2], mean(cor.list), mean(cor.noM.list),
                     mean(cor.noP.list), mean(cor.noMP.list))
rownames(correlation) <- c("True", "With SeM and Phy", "Without SeM",
                           "Without Phy", "Without Phy or SeM")
correlation

signal.d <- rbind(d, colMeans(d.list), colMeans(d.noM.list))
rownames(signal.d) <- c("True", "With SeM and Phy", "Without SeM")
signal.d

est.B <- rbind(c(0, 0, B2), colMeans(B.list), colMeans(B.noM.list),
               colMeans(B.noP.list))
rownames(est.B) <- c("True", "With SeM and Phy", "Without SeM", "Without Phy")
colnames(est.B) <- rownames(z$B)
est.B

# Example simulation output
# correlation
                        # [,1]
# True               0.7000000
# With SeM and Phy   0.7055958
# Without SeM        0.3125253
# Without Phy        0.4054043
# Without Phy or SeM 0.3476589

# signal.d
                     # [,1]      [,2]
# True             0.300000 0.9500000
# With SeM and Phy 0.301513 0.9276663
# Without SeM      0.241319 0.4872675

# est.B
                        # B1.0      B2.0     B2.V1
# True              0.00000000 0.0000000 1.0000000
# With SeM and Phy -0.01285834 0.2807215 0.9963163
# Without SeM       0.01406953 0.3059110 0.9977796
# Without Phy       0.02139281 0.3165731 0.9942140

Run the code above in your browser using DataLab