Descriptors-class: Class "Descriptors"

Description

The descriptors class is an extension to the data.frame class and contains, in addition to the descriptors, information about any response data and p-values which describe the difference between the sequences vs. the space of possible sequences. The class should be created by a call to descriptors (see arguments and details below) or simpleDescriptors.

Arguments

Objects from the Class

Objects can be created by calls of the form

descriptors(seqs, response=numeric(0), base.matrix=NA, do.var=TRUE, alags=c(1,2,3), do.mean=TRUE, do.counts=TRUE, do.position=TRUE, alphabet=seqs@alphabet, include.statistics=TRUE, accuracy=0.01)

Details

The descriptor calculation methods used here are not as sophisticated as those provided in some of the more complete QSAR packages. Instead, it relies on making various permutations of descriptors calculated on single amino acids. There are two reasons for this. First, it is easy to calculate descriptors quickly, without relying on another program. Second, it is easier to treat calculating the distribution of the descriptors of the sequence space. The ability to calculate the descriptors across the sequence space also depends on the number of descriptors and the chain length of the sequence. The advantage of knowing descriptors on the whole sequence space is that it is easy to determine if a descriptor on the sequences is significant. For example, if the number of hydrogen bond donors is three standard deviations above the mean number of hydrogen bond donors over all sequence space, then that is a significant descriptor. This is expressed as a p-value, which is calculated from a wilcox.test. That is a non-parametric version of the Student's t-test.

The calculations are based on the given base.matrix parameter. Given that matrix, which contains the descriptors calculated on all the individual amino acids, it is possible to calculate many sequence level descriptors. If the means are being calculated (do.mean=true), then the mean of the descriptors for each sequence is calculated. This doubles the number of descriptors. The same is true of the do.var, which uses variance along the sequence. The autocorrelation function can also be calculated along the chain, again increasing the number of resulting descriptors. This may be interesting for describing alternating patterns. The position specific descriptors are simply the individual descriptors at a certain position. For example, number of hydrogen bond donors at position 2.

One often is more interested in understanding what is common amongst the active sequences. This may be done by comparing a descriptor on the active sequences to the inactive sequences. Since inactive sequences are rarely collected in peptide libraries, we may approximate the inactive sequences as all sequences. This assumption only holds if there is a low number of active sequences relative to the size of the sequence diversity. This is often the case but must be observed during the experiment. With this assumption, p-values may be calculated for each descriptor. These p-values do not assume normality and are a measure of the overlap between the active sequences and inactive sequences. They are calculated using a Wilcox t-test. A low p-value is considered significant and such a desciptor may be considerd to be related to activity. Remember that a descriptor may be important in connection to a motif. Thus it is important to do both descriptors and motif discovery. include.staistics will calculate the p-values for each of the descriptors. This is only practical for smaller lengths; less than 10.

If base.matrix is NA, then the default will be used, defaultBaseMatrix. See the documentation on that dataset for more information.

Extends

Class "data.frame", directly. Class "list", by class "data.frame", distance 2. Class "oldClass", by class "data.frame", distance 2. Class "vector", by class "data.frame", distance 3.

Examples

Run this code

#calculate some descriptors
data(SHP2Sequences)

#turn off most of the descriptors so it goes fast
SHP2desc <- descriptors(SHP2Sequences, do.var=FALSE,
alags=c(), do.mean=TRUE, do.counts=FALSE,
do.position=FALSE, include.statistics=FALSE)


#plot them
plot(SHP2desc)


#get some descriptors and response sets
data(AMPSequences)
data(AMPSequences.response)

AMPdesc <- descriptors(AMPSequences, response=AMPSequences.response[,1], do.var=FALSE,
alags=c(), do.mean=TRUE, do.counts=FALSE,
do.position=FALSE, include.statistics=FALSE)

#plot with descriptors
plot(AMPdesc)