Format
MotifDb
object of length 426; to access metadata
use mcols(hocomoco)
- providerName
- Name provided by HOCOMOCO
- providerId
- ID provided by HOCOMOCO including experiment type
- dataSource
"HOCOMOCO"
- geneSymbol
- Gene symbol for the transcription factor
- geneId
- Entrez gene id for the transcription factor
- geneIdType
"ENTREZ"
- proteinId
- UNIPROT id for the transcription factor
- proteinIdType
"UNIPROT"
- organism
"Hsapiens"
- sequenceCount
- Number of sequences evaluated for producing the PWM
- bindingSequence
- Consensus sequence for the motif
- bindingDomain
NA
incomplete - tfFamily
NA
incomplete - experimentType
- from http://autosome.ru/HOCOMOCO/Details.php#200
quoted here: "TFBS model identification modes To construct TFBS models ChIPMunk was run four times: two times (f1) and (f2)
with uniform model positional prior and two times (si) and (do) with
informative model positional prior. The min-to-max (f1) model length estimation mode was used with the min length
of 7 bp and increasing it by 1 bp until the default max length of 25 bp was
reached following the optimal length selection procedure as in Kulakovskiy
and Makeev, Biophysics, 2009. For max-to-min (f2) model length estimation
mode we started from 25 bp and searched for the best alignment decreasing the
length by 1 bp until the minimal length of 7 bp. We also used the single (si)
and double box (do) model positional priors in order to simulate DNA helix
turn. For a single box, the positional weights are to be distributed as
cos2($pi$ n / T), where T=10.5 is the DNA helix pitch, n is the coordinate
within the alignment, and the center of the alignment of the length L is at
n=0. During the internal cycle of PWM optimization the PWM column scores are
multiplied by prior values so the columns closer to the center of the
alignment (n=0) receive no score penalty while the columns around (n =
5,6,-5,-6) contribute much less to the score of the PWM under optimization.
The single box model prior was used along with the min-to-max length
estimation mode (si). We also used the double box model prior with a shape
prior equal to sin2($pi$n / T), which was used to search for possibly longer
double box models in the max-to-min length estimation mode (do). Model quality assignment The resulting models were rated (from A to F) according to their quality.
Model quality rates from A-to-D were assigned to proteins known to be TFs,
including those listed in Schaefer et al., Nucleic Acids Research, 2011 with
addition of a number of proteins having relevant models and sufficient
evidence to be TFs. The ratings were assigned by human curation according to
the following criteria: Relevant distribution of position-specific information content over alignment
columns, which means a model LOGO representation displaying well formed core
positions with a high information content surrounded by flanking letters with
lower information content; the information content at flanking positions
decreasing with the distance from the model core. "Stability", which means that in more than one of the ChIPMunk modes we
obtained models with a similar length, consensus, and comparable number of
aligned binding sites, along with a similar shape of model LOGO
representation. "Similarity" of the model to the binding sequence consensus
for this TF given in the UniProt or other databases, which means similarity
of the shape of the model LOGO and TFBS lengths to those of other TFs from
the same TF family. "A total number of binding sites" was also considered as
a quality measure, as a large set of binding regions (mostly but not limited
to ChIP-Seq and parallel SELEX) implies that there are many observations of
each letter in any position of the alignment, particularly many observations
of non-consensus letters in core positions. In positions with low information
content, where there is no strong consensus, all variants have many
observations, and thus the observed letter frequencies are less dependent on
statistical fluctuations. Quality A was assigned to high confidence models complying with all four
criteria listed in the section above. Quality B was assigned to models built
from large sequence sets that failed no more than one out of the three
remaining criteria. Quality C was assigned to models built from small
sequence sets but (with a number of specifically marked exceptions) complying
with the three remaining criteria. Quality D models missed part of the known
consensus sequence or had no clearly significant core positions in the TFBS
model. Quality E (error) was assigned to models for proteins not convincingly
shown to be TFs or to models exhibiting an irrelevant LOGO shape or a wrong
consensus sequence. Quality F (failure) was assigned to TFs for which there
was no reliable model identified."
- pubmedID
"23175603"
see Source
for more details
Source
Kulakovskiy,I.V., Medvedeva,Y.A., Schaefer,U., Kasianov,A.S.,
Vorontsov,I.E., Bajic,V.B. and Makeev,V.J. (2013) HOCOMOCO: a comprehensive
collection of human transcription factor binding sites models. Nucleic
Acids Research, 41, D195--D202.