Annotation Specific Kernel
Assign annotation metadata to sequences and create a kernel object which evaluates annotation information Show biological sequence together with annotation
showAnnotatedSeq(x, sel = 1, ann = TRUE, pos = TRUE, start = 1, end = width(x)[sel], width = NA)
## S4 method for signature 'XStringSet' ## annotationMetadata(x, annCharset= ...) <- value
## S4 method for signature 'BioVector' ## annotationMetadata(x, annCharset= ...) <- value
## S3 method for class 'BioVector': annotationMetadata(x, ...) <- value
## S3 method for class 'XStringSet': annotationMetadata(x)
## S3 method for class 'BioVector': annotationMetadata(x)
## S3 method for class 'XStringSet': annotationCharset(x)
## S3 method for class 'BioVector': annotationCharset(x)
- biological sequences in the form of a
- single index into x for displaying a specific sequence. Default=1
- show annotation information along with the sequence
- show position information
- first postion to be displayed, by default the full sequence is shown
- last position to be displayed or use parameter 'width'
- number of positions to be displayed or use parameter 'end'
- additional parameters which are passed transparently.
- character vector with annotation strings with same length as the number of sequences. Each anntation string must have the same number of characters as the corresponding sequence. In addition to the characters defined in the annotation character set the character "-" can be used in the annotation strings for masking sequence parts.
- character string listing all characters used in annotation sorted ascending according to the C locale, up to 32 characters are possible
Annotation information for sequences
For the annotation specific kernel additional annotation information is
added to the sequence data. The annotation for one sequence consist of a
character string with a single annotation character per position, i.e.
the annotation sequence has the same length as the sequence. The character
set used for annotation is defined user specific on XStringSet level
with up to 32 different characters. Each biological sequence needs
an associated annotation sequence assigned consisting of characters from
this character set. The evaluation of annotation information as part of
the kernel processing during generation of a kernel matrix or an explict
representation can be activated per kernel object.
Assignment of annotation information
The annotation characterset consists of a character string listing all
allowed annotation characters in alphabetical order. Any single byte ASCII
character from the decimal range between 32 and 126, except 45, is allowed.
The character '-' (ASCII dec. 45) is used for masking sequence parts which
should not be evaluated. As it has assigned this special masking function
it must not be used in annotation charactersets.
The annotation characterset is assigned to the sequence set with the
annotationMetadata function (see below). It is stored in the
metadata list as named element
annotationCharset and can be stored
along with other metadata assigned to the sequence set. The annotation
strings for the individual sequences are represented as a character vector
and can be assigned to the XStringSet together with the assignment of the
annotation characterset as element related metadata. Element related
metadata is stored in a DataFrame and the columns of this data frame
represent the different types of metadata that can be assigned in parallel.
The column name for the sequence related annotation information is
"annotation". (see Example section for an example of annotation metadata
assignment) Annotation metadata can be assigned together with position
positionMetadata to a sequence set.
Annotation Specific Kernel Processing
The annotation specific kernel variant of a kernel, e.g. the spectrum kernel
appends the annotation characters corresponding to a specific kmer to this
kmer and treats the resulting pattern as one feature - the basic unit for
similarity determination. The full feature space of an annotation specific
spectrum kernel is the cartesian product of the set of all possible sequence
patterns with the set of all possible anntotions patterns. Dependent on the
number of characters in the annotation character set the feature space
increases drastically compared to the normal spectrum kernel. But through
annotation the similarity consideration between two sequences can be split
into independent parts considered separately, e.g. coding/non-coding,
exon/intron, etc... . For amino acid sequences e.g. a heptad annotation
(consisting of a usually periodic pattern of 7 characters (a to g) can be
used as annotation like in prediction of coiled coil structures. (see
reference Mahrenholz, 2011)
annSpec passed during creation of a kernel object controls
whether annotation information is evaluated by the kernel. (see functions
spectrumKernel, gappyPairKernel, motifKernel)
In this way sequences with annotation can be evaluated annotation specific
and without annotation through using two different kernel objects. (see
examples below) The annotation specific kernel variant is available for all
kernels in this package except for the mismatch kernel.
With this function annotation metadata can be assigned to sequences defined
as XStringSet (or BioVector). The sequence annotation strings are stored
as element related information and can be retrieved with the method
mcols. The characters used for anntation are stored as
annotation characterset for the sequence set and can be retrieved
with the method
metadata. For the assignment of annotation
metadata to biological sequences this function should be used instead of the
lower level functions metadata and mcols. The function
annotationMetadata performs several checks and also takes care
that other metadata or element metadata assigned to the object is kept.
Annotation metadata are deleted if the parameters
annotation are set to NULL.
This function displays individual sequences aligned with the annotation
string with 50 positions per line. The two header lines show the start
postion for each bock of 10 characters.
The method annotationMetadata<- assigns annotation metadata to a sequence
set. In the assignment also the annotation characterset must be specified.
Annotation characters which are not listed in the characterset are treated
like invalid sequence characters. They interrupt open patterns and lead
to a restart of the pattern search at this position.
annotationMetadata: a character vector with the annotation strings
annotationCharset: a character vector with the annotation
## create a set of annotated DNA sequences ## instead of user provided sequences in XStringSet format ## for this example a set of DNA sequences is created x <- DNAStringSet(c("AGACTTAAGGGACCTGGTCACCACGCTCGGTGAGGGGGACGGGGTGT", "ATAAAGGTTGCAGACATCATGTCCTTTTTGTCCCTAATTATTTCAGC", "CAGGAATCAGCACAGGCAGGGGCACGGCATCCCAAGACATCTGGGCC", "GGACATATACCCACCGTTACGTGTCATACAGGATAGTTCCACTGCCC", "ATAAAGGTTGCAGACATCATGTCCTTTTTGTCCCTAATTATTTCAGC")) names(x) <- paste("S", 1:length(x), sep="") ## define the character set used in annotation ## the masking character '-' is is not part of the character set anncs <- "ei" ## annotation strings for each sequence as character vector ## in the third and fourth sample a part of the sequence is masked annotStrings <- c("eeeeeeeeeeeeiiiiiiiiieeeeeeeeeeeeeeeeiiiiiiiiii", "eeeeeeeeeiiiiiiiiiiiiiiiiiiieeeeeeeeeeeeeeeeeee", "---------eeeeeeeeeeeeeeeeiiiiiiiiiiiiiiiiiiiiii", "eeeeeeeeeeeeeeeeeeeeeeeiiiiiiiiiiiiiiiiiiii----", "eeeeeeeeeeeeiiiiiiiiiiiiiiiiiiiiiiieeeeeeeeeeee") ## assign metadata to DNAString object annotationMetadata(x, annCharset=anncs) <- annotStrings ## show annotation annotationMetadata(x) annotationCharset(x) ## show sequence 3 aligned with annotation string showAnnotatedSeq(x, sel=3) ## create annotation specific spectrum kernel speca <- spectrumKernel(k=3, annSpec=TRUE, normalized=FALSE) ## show details of kernel object kernelParameters(speca) ## this kernel object can be now be used in a classification or regression ## task in the usual way or you can use the kernel for example to generate ## the kernel matrix for use with another learning method in another R ## package. kma <- speca(x) kma[1:5,1:5] ## generate a dense explicit representation for annotation-specific kernel era <- getExRep(x, speca, sparse=FALSE) era[1:5,1:8] ## when a standard spectrum kernel is used with annotated ## sequences the anntotation information is not evaluated spec <- spectrumKernel(k=3, normalized=FALSE) km <- spec(x) km[1:5,1:5] ## finally delete annotation metadata if no longer needed annotationMetadata(x) <- NULL ## show empty metadata annotationMetadata(x) annotationCharset(x)