The hypotheses are specified in R format as model formulae or model terms. The phylogeny can be specified either using an internal "phy" format or a matrix of taxonomic variables -- conversion functions are supplied from the standard newick and phylo formats. Node heights can be specified to determine the phylogenetic covariance structure of the error, or left to default values. The user can specify a value of rho (a parameter that determines whether branch lengths are bigger near the root or near the tips), or ask that it be fitted by maximum likelihood.
To get acquainted, bother only with the first three arguments (control model, test terms, and dataset) and then either the fourth or fifth to specify the phylogeny in one or other of two ways. The rest can wait!
The default outputs provides a p-value and F-ratio, and repeats the control and control+test models. Much more information is available, though much of that needs to be interpreted with care.
phyreg(control, test, data, subset, phydata, taxmatrix, heightsdata,
rho = -1, lorho = 0.3, hirho = 0.6, errrho = 0.02, minrho = 1e-04,
tolerance = 1e-06, oppf = 5, opdf = 0, parmx = 0, parmxz = 0,
opfunccall = 0, addDF = 0, linputs = FALSE, sinputs = FALSE, means = FALSE,
lmshortx = FALSE, lmshortxz = FALSE, lmlongx = FALSE, lmlongxz = FALSE,
hinput = FALSE, paper = FALSE, dfwarning = TRUE, oprho = FALSE, plot=FALSE,
reset=FALSE)
y~1
. (y~0
is not permitted.)
y~b
, and test for a*x
, you can specify (i) test="a*x"
or (ii) test=.~b+a*x
(Note the LHS *must* be a dot for this "update-style" specification), or (iii) test=~+a*x
. (Note the "+" and the absence of the response variable in this final, "adding terms", form.)
subset=NULL
phydata
or taxmatrix
must be supplied. To unspecify, say phydata=NULL
. A phylogeny in newick or phylo format can be converted into the internal format, preserving height information, with phyfromnewick
or phyfromphylo
.
taxmatrix
is specified, you can alternatively specify heightsdata
with a vector containing one height for each level you've given, plus a height for the root, or if that is NULL
the default "Figure 2" heights of Grafen (1989) will be used. Either taxmatrix
or phydata
(-- not both) must be supplied. Despecify with taxmatrix=NULL
.
phydata
, as the root has a height but no parent. If taxmatrix
is specified, then heightsdata
can also be specified as a vector of length one plus the number of taxonomic vectors, in which case each node is given a height corresponding to its level in the taxonomy -- the extra element is the height of the root, and the species are assumed to have a height of zero. A parent node must be higher than its daughter -- zero branch lengths are not allowed, but are equivalent to combining the daughters of the two nodes under a single node.
rho
will be fitted by maximum likelihood. If positive, that value will taken for rho and no fitting will be done. Once the node heights are scaled to lie between 0 and 1, each height is raised to the power rho before being used to determine the error structure in the model. See Grafen (1989) for details.
errrho
minrho
. A zero value of rho is problematic for the program.
phyreg
, with ndigits=oppf
. If zero, they are not printed out.
phyreg
.
phyreg
.
phyreg
. Note these should be regarded with caution, as the value of rho was fitted just on the control formula. For better estimates, run phyreg
again with the old control+test as the new control, and set opparmx=1
. But opparmxz
will usually give a good rough guide.
x<-phyreg(...)
as x$details$influence$RCT
and x$details$influence$DR
. The horizontal axis, RCT
, is the residual of the point after fitting the control+test model. The vertical axis, DR
, contains the improvement in that point through including the test terms, measured as the difference between the squares of the residuals. In each case, the residuals are adjusted for degrees of freedom, i.e. divided by the residual degree of freedom of the control or control+test model as appropriate. The variables are then scaled to make them easier to read and interpret. DR
is scaled so it adds to 100, and each point then indicates the percentage of the improvement in SS that is due to a higher node (possibly negative if it fits less well after the test variables are added). RCT
is scaled so the mean square error is 1
, and so Normal errors would give 95% of points within 1.96 of the mean (but remember the regression is through the origin so the average of RCT
does not have to be zero). The significance of the plot is that extreme points on the horizontal axis fit badly after adding the test variables, while those near zero fit well. Points near zero on the vertical axis have been little changed in their contribution to badness-of-fit by adding the test variables, while highly positive points have had a lot of their deviation explained by the test variables.TRUE
or 1
, the inputs to the long regression will be stored. After x<-phyreg(...,linputs=1)
, it will be found that x$linputs$y
, x$linputs$designxz
and x$linputs$w
will contain the response, design matrix, and weights for the long regression, respectively. Use with care. The whole point of the the problem of unrecognised phylogeny is that the standard errors from this regression cannot be trusted, though with a dependable binary phylogeny, the output can be accepted at face value.
TRUE
or 1
, the inputs to the short regression will be stored. After x<-phyreg(...,sinputs=1)
, it will be found that x$sinputs$y
, x$sinputs$designxz
and x$sinputs$w
will contain the response, design matrix, and weights for the short regression, respectively. Use with care. Not only can the standard errors from this regression not be trusted, but the parameter estimates are biassed!
TRUE
or 1
, the phylogenetically-weighted means will be printed for the response and for each variable in the design matrix of the control+test model (in the original species dataset, which are the same as the weighted means in the long regression).
TRUE
or 1
, then after x<-phyreg(...,lmshortx=1)
, the lm.wfit()
output from the short regression with the control formula will be stored in x$lmshortx
. Use with care -- the parameter estimates are biassed...
TRUE
or 1
, then after x<-phyreg(...,lmshortxz=1)
, the lm.wfit()
output from the short regression with the control+test formula will be stored in x$lmshortxz
. Use with care -- the parameter estimates are biassed...
TRUE
or 1
, then after x<-phyreg(...,lmlongx=1)
, the lm.wfit()
output from the long regression with the control formula will be stored in x$lmlongx
. The SEs cannot be trusted for non-binary phylogenies.
TRUE
or 1
, then after x<-phyreg(...,lmlongxz=1)
, the lm.wfit()
output from the long regression with the control+test formula will be stored in x$lmlongxz
. The SEs cannot be trusted for non-binary phylogenies.
TRUE
or 1
, then after x<-phyreg(...,hinput=1)
, the heights used will be stored in x$hinput
. If you use taxmatrix=...
, this is a way to obtain the internal "phy" version of the heights for each higher node. These heights have not been modulated by rho i.e. they are the starting heights before rho is applied.
TRUE
or 1
, then after x<-phyreg(...,paper=1)
, various matrices that appear in the appendix of the source paper (Grafen 1989) will be stored under x$paper
. Type names(x$paper)
to see which ones!
TRUE
or 1
, the loss of degrees of freedom for phylogenetic reasons is detailed. It is a "warning", so if none are lost, nothing is printed, and it doesn't occupy space in your output.ndigits=oprho
. If rho was fitted, the value of rho and the log-likelihood will be given, and how the search ended (lower boundary, or an internal maximum).
TRUE
or 1, the parameters are all reset to the start-of-session values before applying the other parameters you specify in the same call of phyreg
. Useful if you want to start a new project, or just be sure there's no previous setting that is lurking and affecting your results.subset
and on missing valuessubset
)shornode[5]
gives the ID number of the node for the 5th datapoint.$poff
(the p-value), $testf
(the F-ratio), $testnumdf
(the numerator DF), $testdendf
(the denominator DF), $nspec
(total number of species, included and excluded, so it equals nrow(input_dataframe))
, $nommiss
(the number of species omitted because of missing values (not counting those already omitted because of the argument subset
)), $nspecactive
(number of species included in the analysis), $nomspuse
(number of species omitted because of the argument subset
), $nphyhinodes
(the number of non-species nodes in the supplied phylogeny (ie. with all species included)), nhinodes
(number of higher nodes in the phylogeny as used, which may be reduced by missing data or by subset
), $hilostomsp
(the number of higher nodes lost through species omitted because of the argument subset
, $hilostvar
(number of higher nodes lost to the short regression through lack of variability in the long regression -- see "Phylogenetic Degrees of Freedom" in Details), $shortotdf
(total DF in short regression), $dflx
(rank of the long regression with the control model), $dfxlost
(loss of numerator DF in short regression -- see "Phylogenetic Degrees of Freedom" in Details), $dfsx
(rank of short regression with control model), $testlongdf
(=$dflxz-$dflx
, the degrees of freedom for the test terms in the long regression), $testlost
(degrees of freedom for the test variables, lost for phylogenetic reasons -- see "Phylogenetic Degrees of Freedom" in Details)), $shortcondf
(control degrees of freedom in the short regression), $addDF
(retains the argument addDF
as it's needed in calculations), $rho
(the value of rho used in the final test), $lik
(the log-likelihood of that value of rho), $edge
(coded -1 for error, 2 for rho set by the user, and, with a fitted rho, 0 for an internal maximum of the likelihood, and 1 for a maximum at the lower edge of the search region (specified by minrho
)), $missingnodes
(the numbers of the higher nodes omitted for lack of variability in the short regression, but an extra 0 at the start -- see Details), and finally a data frame $influence
containing three variables about the addition of the test variables to the short regression (NN
identifies the higher node with its ID number, and $RCT
and $DR
are measures for each node explained under the plot
argument above -- use the function who
for help pinning down what each higher node is).phyreg
phydata
if you specified the phylogeny that way, but otherwise has been converted from taxonomic vectors.usedphy
than in fullphy
, because only higher nodes with two or more daughters are retained. This vector is useful if you want to study the matrices from the paper (see the argument and value paper
) as many are indexed by usedphy
. All node numbers reported by the program elsewhere are the original IDs, and so usedphy
is important only for internal and/or technical reasons.usedphy
, which node in fullphy
it corresponds to. All the usual input already uses those fullphy
names, so you would need this only if you were going exploring in lmshortx(z)
or studying usedphy
.sinputs$y
, sinputs$design
and sinputs$w
for the response, design matrix and weights. Choose to store using the argument of the same name.linputs$y
, linputs$design
and linputs$w
for the response, design matrix and weights. Choose to store using the argument of the same name.lm.wfit()
output from the long regression with the control model. Choose to store using the argument of the same name.lm.wfit()
output from the long regression with the control+test model. Choose to store using the argument of the same name.lm.wfit()
output from the short regression with the control model. Choose to store using the argument of the same name.lm.wfit()
output from the short regression with the control+test model. Choose to store using the argument of the same name.means
among the other values. You can calculate residuals as residuals <- y - pglsGVx
pglsFVx
, mutatis mutandis, except the fitted values are based on the control+test model.phydata
if you use that way of specifying a phylogeny. When converting from a newick-style phylogeny, again the order of the species in the newick format is carried over to the $phy
variable, and so determines their ID numbers. If phyfromphylo
is used, the code numbers of the species in the "phylo" object determine the order they are placed in the internally formatted "phy" vector, so the original code numbers become their ID numbers. These species ID numbers matter because they must be aligned with the dataset, and are fixed no matter whether that species or another is excluded by the subset
parameter of phyreg
, or for missing data. Higher nodes have ID numbers too. In every phylogeny, each node's parent has a higher ID number than it does, except the root, which has no parent. The numbers for higher nodes do vary when species are omitted for any reason. A function is provided to identify higher nodes for you, by listing either their daughters or all the species below them -- see who
.
The program is a linear model, and so allows the independent variables to be continuous or categorical, and allows models to contain interactions. R's conventions for model formulae are followed, so see help(formula)
for the symbols for interaction and so on, and help(factor)
for how to declare a variable as categorical (the default is continuous).Missing data is handled automatically: you needn't do anything special, though a species is simply omitted from an analysis for which it has missing data in the response or in the control or test variables. It is assumed that missing data has the standard R value of NA
.
You can choose which species to include or omit in a given analysis using the subset
argument, set to 0 or 1 for omission or inclusion. Clear by asserting subset=NULL
, and all species will be included, subject to missing data (see just above).
The main feature of phylogenetic data that the Phylogenetic Regression deals with differently from other methods is unrecognised phylogeny. It operates on the principle that each higher node should provide one datapoint to a valid test, and not more, and does so by choosing just one linear contrast across the daughters of each higher node. (This is different from Purvis and Garland (1993)'s suggestion of calculating a test allowing more datapoints per higher node, and then looking up the critical value as though you hadn't.) The contrast coefficients come from the residuals of the long regression on the control model. For full mathematical and conceptual justifications of this choice, see Grafen (1989, 1992). With binary phylogenies, of course, there is no unrecognised phylogeny. A parameter of the branch lengths is fitted automatically, unless the user wishes to impose a value, which allows the strength of phylogenetic association to be make weaker or stronger. Simulation studies in Grafen (1989) show that the method has good properties, and also that ignoring unrecognised phylogeny can create serious problems.
For a discussion of the place of the Phylogenetic Regression in current theory and methods, see the phyreg-package
.
On some occasions, degrees of freedom are lost "for phylogenetic reasons" (Grafen 1989, section 3(e)). A whole node may be lost to the final test if the residuals of its daughter nodes are all zero in the long regression. This can happen for various reasons, most often when (i) the response is in fact binary, and so there is no variation in it below a node, or (ii) a categorical variable has so many values restricted to one part of the tree that a subset of its parameter values can adjust to render all the residuals zero in that part of the tree. That is called a node being lost in the denominator. The other possibility is that once the contrasts have been taken across each higher node, the design matrix for the model has lower rank than it did before, which is called losing a degree of freedom in the numerator (it is transferred to the denominator). You can choose to be warned when degrees of freedom are lost for phylogenetic reasons (use dfwarning=1
, or to see a whole breakdown of degrees of freedom including any lost (use opdf=1
). If nodes are lost in the denominator, their ID numbers are stored in the output $details$missingnodes
, though note this also contains an initial 0
for programming convenience. These phylogenetic degree of freedom issues need to be handled properly to provide a valid test.
The data variables (data
, phydata
, heightsdata
, taxmatrix
) are stored by name, and so if you change a dataset, and then call phyreg
, those changes will be incorporated, even when that argument is just carried on by default from a previous call. The other variables, including subset
, are stored by value, and so remain the same until set again in phyreg
.
The data for long and short regressions, and the lm.wfit()
output are provided on request, but must be used with great caution. The whole point of the phylogenetic regression is that neither type of regression provides results that can be taken at face value, except the long regression under binary phylogenies. Use at your own risk! See Grafen (1989) for details.
Some simulated data is included to facilitate the examples below -- see myd0
phyreg-package
. The function inf
displays information about stored phyreg
output. phyfromnewick
and phyfromphylo
convert phylogenies from two different standard formats into the internal format for the package. For simulated data see myd0
. A function printparms
shows you the current remembered set of parameters for phyreg
. To identify higher nodes from their ID numbers, use who
.
## First get some data
data(myd0,myd1, myd2, myd3, extax)
##
## Then do our first analysis
phyreg(y~X1,"A",myd0,taxmatrix=extax)
## and test instead for "B", noticing that only the changed parameter need be given
phyreg(test="B")
## and we do more complicated analysis involving an interaction with an existing term
phyreg(y~X1+X2,"A*X1")
##
## Now we choose to see the output relating to rho and to how the degrees of freedom
## are determined, and we also wish to see the means for each variable, and the
## parameters from the long regression on control+test variables
phyreg(oprho=6, opdf=1, means=1,parmxz=1)
## To illustrate inf, we store the results of an analysis in m1
m1<-phyreg(y~A,"X1+X2")
## Note we still get the extra output from the previous call, because those parameters
## too are remembered within a session. But we can see it again, whether or not we
## saw it the first time, with inf. inf reminds you if you forget quite how to use it
inf()
inf(m1)
inf(m1,oppf=3)
inf(m1,oppf=7, oprho=5)
inf(m1, oppf=5, "means", "parmx")
inf(m1,"sinputs","shortx")
## The final call asks for things m1 doesn't have because it wasn't stored at the time.
## inf tells you about remembered output. To find out about your own input you can
printparms()
## This is the set of remembered internal parameters for phyreg.
##
## The phylogeny has so far been determined by a data frame of taxonomic variables
## in the argument taxmatrix. If we have the phylogeny available in newick style, we can
## convert to the interal format, and then use that instead. Fortunately, one is provided.
## Note it is good form to unset the other method of specifying a phylogeny (which is
## being remembered by the package) with taxmatrix=NULL
data(newickstr)
z<-phyfromnewick(str=newickstr)
phyreg(phydata=z$phy,taxmatrix=NULL)
## ... and if branch lengths were supplied, and we trust them, we can
phyreg(y~X1, "A", phydata=z$phy, heightsdata=z$hts)
##
## Similarly with a phylogeny in phylo format. We obtain one of those by loading "exphylo".
data(exphylo)
phyconverted<-phyfromphylo(exphylo)
phyreg(phydata=phyconverted$phy, heightsdata=NULL)
## If we don't unset heightsdata, we get an error. Although phyconverted and z describe the
## same phylogeny, there is a singleton higher node (i.e. only one daughter) in
## phyconverted, which phyfromphylo has left in. But phyfromnewick strips them out.
## But of course the phy and hts from the same conversion always work together:
phyreg(phydata=phyconverted$phy, heightsdata=phyconverted$hts)
## The results should all be the same (except for where we switch the main dataset) because
## it's the same phylogeny represented in three different ways, with the same heights.
##
## Now we obtain a plot to investigate which higher nodes are important in the influence of the
## test variables
m2<-phyreg(plot=1)
## ... and we look at the data that's been plotted
m2$details$influence
## The bottom right corner has a point that fits much worse after test is added, and the
## table we've just printed tells us that is node 115. To find out which node that really
## is, we just do
who(m2$fullphy)
## and a table appears telling us which species are under each higher node. Note that
## because phyfromphylo retains singleton higher nodes (those with just one daughter)
## that are present in the original phylo structure, two nodes (121 and 126) have the same
## descendant species (81:100), and one of the nodes (126) is omitted from the short
## regression.
##
## Finally, let's do an lm() to find the true values of the parameters in the dataset myd2
lm(y~X1+X2+X3+A+B+error,data=myd2)
## ... and all is laid bare!
##
## Enjoy!
##
## NOTA BENE: if you have actually run the examples, the parameters will have been left as
## they are and will carry on to your session, so you can explore the examples if
## you like. To reset them for your own work, include "reset=TRUE" in the call
## to phyreg.
##
Run the code above in your browser using DataLab