AhRs: Sample data for partitionMetric
Description
This small dataset contains aligned protein sequences for seven alleles of the
aryl hydrocarbon receptor (AhR).
Format
The format is a character matrix in which column $i$ represents
the $i$'th position in the alignment, and contains an amino
acid code or "-" indicating an indel. Row names contain the
animal species.Source
This dataset was derived from NCBI HomoloGene:1224. Details
A DNA or protein sequence has an associated index set
${1, 2, ..., n}$ that labels the $n$
positions of the nucleotides or amino acids (AA).
This index set can be partitioned such that all members referring to
the same AA share a homogeneous partition.
For example, given the sequence ATGTA and its index
set ${1,2,\ldots,5}$, the "A" partition
contains the subset ${1,5}$, the "T" partition contains
${2,4}$, and so on. Given two aligned sequences and their respective partitions of the
index set, a metric distance between these partitions can be computed. See
partitionMetric for such a metric, along with an example
of clustering this AhR dataset.
References
Mark Hahn, Aryl hydrocarbon receptors: diversity and evolution. Chem
Biol Interact, 2002, 141, 131-160