dClust: Clustering sequences based on domain sequence

Description

Proteins are clustered by their sequence of protein domains. A domain sequence is the ordered sequence of domains in the protein. All proteins having identical domain sequence are assigned to the same cluster.

Usage

dClust(hmmer.table)

Arguments

hmmer.table

A data.frame of results from a hmmerScan against a domain database.

Value

The output is a numeric vector with one element for each unique sequence in the Query column of the input hmmer.table. Sequences with identical number belong to the same cluster. The name of each element identifies the sequence.This vector also has an attribute called cluster.info which is a character vector containing the domain sequences. The first element is the domain sequence for cluster 1, the second for cluster 2, etc. In this way you can, in addition to clustering the sequences, also see which domains the sequences of a particular cluster share.

Details

A domain sequence is simply the ordered list of domains occurring in a protein. Not all proteins contain known domains, but those who do will have from one to several domains, and these can be ordered forming a sequence. Since domains can be more or less conserved, two proteins can be quite different in their amino acid sequence, and still share the same domains. Describing, and grouping, proteins by their domain sequence was proposed by Snipen & Ussery (2012) as an alternative to clusters based on pairwise alignments, see bClust. Domain sequence clusters are less influenced by gene prediction errors.

The input is a data.frame of the type produced by readHmmer. Typically, it is the result of scanning proteins (using hmmerScan) against Pfam-A or any other HMMER3 database of protein domains. It is highly reccomended that you remove overlapping hits in hmmer.table before you pass it as input to dClust. Use the function hmmerCleanOverlap for this. Overlapping hits are in some cases real hits, but often the poorest of them are artifacts.

References

Snipen, L. Ussery, D.W. (2012). A domain sequence approach to pangenomics: Applications to Escherichia coli. F1000 Research, 1:19.

Examples

Run this code

# Using HMMER3 result files in the micropan package
# We need to uncompress them first...
extdata.path <- file.path(path.package("micropan"),"extdata")
filenames <- c("GID1_vs_Pfam-A.hmm.txt",
"GID2_vs_Pfam-A.hmm.txt",
"GID3_vs_Pfam-A.hmm.txt")
pth <- lapply( file.path( extdata.path, paste( filenames, ".xz", sep="" ) ), xzuncompress )

# ...reading the HMMER3 results...
hmmer.table <- NULL
for(i in 1:3){
  htab <- readHmmer(file.path(extdata.path,filenames[i]))
  htab <- hmmerCleanOverlap(htab)   # Cleaning the results by removing overlapping hits
  hmmer.table <- rbind(hmmer.table,htab)
}
# ...and compressing the result files again...
pth <- lapply( file.path( extdata.path, filenames ), xzcompress )

# Finally, the clustering
clustering.domains <- dClust(hmmer.table)

Run the code above in your browser using DataLab