The first argument may be a character vector representing a DNA sequence, a DNA
sequence represented using the SeqFastadna class from the seqinr package,
or a vector containing the relative frequencies of the A, C, G and T nucleic
acids.
Let A, C, G and T denote the relative frequencies of the nucleotide bases
appearing in a DNA sequence. This function carries out a statistical hypothesis
test that the relative frequencies satisfy the relation \(A+G=C+T\), or that
purines \(\{A, G\}\) occur equally as often as pyrimidines \(\{C,T\}\) in a DNA sequence.
The relationship can be rewritten as \(A-T=C-G\), from which it is easy to see
that the property being tested is a generalisation of Chargaff's second parity
rule for mononucleotides, which states that \(A=T\) and \(C=G\). The test is
set up as follows:
\(H_0\): \(A+G \neq C+T\)
\(H_1\): \(A+G = C+T\)
If type is set to “simplex”, the vector \((A,C,G,T)\) is
assumed to come from a Dirichlet(1,1,1,1) distribution on the 3-simplex under
the null hypothesis. Otherwise, if type is set to
“interval”, it is assumed under the null hypothesis that
\((A+G,C+T)\) ~ Dirichlet(1,1) or, in other words, \(A+G\) and \(C+T\) are
uniformly distributed on the unit interval and satisfy \(A+G+C+T=1\).
In both cases, the test statistic is \(\eta_V^* = |A+G-0.5|\).