
Last chance! 50% off unlimited learning
Sale ends in
linWeight(d, sigma = 1)expWeight(d, sigma = 1)
gaussWeight(d, sigma = 1)
swdWeight(d)
## S4 method for signature 'XStringSet'
## positionMetadata(x) <- value
## S4 method for signature 'BioVector'
## positionMetadata(x) <- value
## S3 method for class 'XStringSet':
positionMetadata(x)
## S3 method for class 'BioVector':
positionMetadata(x)
DNAStringSet
, RNAStringSet
,
AAStringSet
(or as BioVector
)distWeight
to 1 in the functions
spectrumKernel, gappyPairKernel, motifKernel
.
This parameter value is in fact interpreted as a numeric vector with 1 for
zero distance and 0 for all other distances.
Positive Definiteness
The standard SVMs only support positive definite kernels / kernel matrices.
This means that the distance weighting function must must be chosen such
that the resulting kernel is positive definite. For positive definiteness
also symmetry of the distance weighting function is important. Unlike usual
distances the relative distance value here can have positive and negative
values dependent on whether the pattern in the second sequence is located
at higher or lower positions than the pattern in the first sequence. The
predefined distance weighting functions except for swdWeight deliver a
positive definite kernel for all parameter settings. According to Sonnenburg
et al. 2005 the SWD kernel has empirically shown positive definiteness but
it is not proved for this kernel. If a weight vector with predefined weights
per distance is passed to the kernel instead of a distance weighting
function positive definiteness of the kernel must also be ensured by
adequate selection of the weight values.
User-Defined Distance Function
For user defined distance functions symmetry and positive definitness of
the resulting kernel are important. Such a function gets a numeric distance
vector 'x' as input (and possibly other parameters controlling the weighting
behavior) and returns a weight vector of identical length. When
called with a missing parameter x all other parameters must be supplied or
have appropriate default values. In this case the function must return a
new function with just the single parameter x which calls the original user
defined function with x and all the other parameters set to the values passed
in the call.
This behavior is needed for assignment of the function with missing
parameter x to the distWeight parameter in the kernel. At the time of kernel
definition the actual distance values are not available. Later when
sequence data is passed to this kernel for generation of a kernel matrix or
an explicit representation this single argument function is called to get
the distance dependent weights. The code for the predefined expWeight
function in the example section below shows how a user-specific
function can be set up.
Offset
To allow flexible alignment of sequence positions without redefining the
XStringSet or BioVector an additional metadata element named offset can be
assigned to the sequence set via positionMetadata<-
(see example
below). Position metadata is a numeric vector with the same number of
elements as the sequence set and gives for each sequence an offset to
position 1. When positions metadata is not assigned to a sequence set the
position 1 is associated with the first character in each sequence of the
sequence set., i.e. in this case the sequences should be aligned such that
all have the same starting positions with respect to the learning task
(e.g. all sequences start at a transcription start site). Offset information
is only evaluated in position dependent kernel variants.sequenceKernel
in three different ways representing the
full range of position dependency:
presence
in functionsspectrumKernel,
gappyPairKernel, motifKernel
) in the sequences
into account for similarity determination.spectrumKernel
, gappyPairKernel
,
motifKernel
, annotationMetadata
,
metadata
, mcols
## plot predefined weighting functions for sigma=10
curve(linWeight(x, sigma=10), from=-20, to=20, xlab="pattern distance",
ylab="weight", main="Predefined Distance Weighting Functions", col="green")
curve(expWeight(x, sigma=10), from=-20, to=20, col="blue", add=TRUE)
curve(gaussWeight(x, sigma=10), from=-20, to=20, col="red", add=TRUE)
curve(swdWeight(x), from=-20, to=20, col="orange", add=TRUE)
legend('topright', inset=0.03, title="Weighting Functions", c("linWeight",
"expWeight", "gaussWeight", "swdWeight"),
fill=c("green", "blue", "red", "orange"))
text(14, 0.70, "sigma = 10")
## instead of user provided sequences in XStringSet format
## for this example a set of DNA sequences is created
## RNA- or AA-sequences can be used as well with the motif kernel
dnaseqs <- DNAStringSet(c("AGACTTAAGGGACCTGGTCACCACGCTCGGTGAGGGGGACGGGGTGT",
"ATAAAGGTTGCAGACATCATGTCCTTTTTGTCCCTAATTATTTCAGC",
"CAGGAATCAGCACAGGCAGGGGCACGGCATCCCAAGACATCTGGGCC",
"GGACATATACCCACCGTTACGTGTCATACAGGATAGTTCCACTGCCC",
"ATAAAGGTTGCAGACATCATGTCCTTTTTGTCCCTAATTATTTCAGC"))
names(dnaseqs) <- paste("S", 1:length(dnaseqs), sep="")
## create a distance weighted spectrum kernel with linear decrease of
## weights in a range of 20 bases
spec20 <- spectrumKernel(k=3, distWeight=linWeight(sigma=20))
## show details of kernel object
kernelParameters(spec20)
## this kernel can be now be used in a classification or regression task
## in the usual way or a kernel matrix can be generated for use with
## another learning method
km <- spec20(x=dnaseqs, selx=1:5)
km[1:5,1:5]
## instead of a distance weighting function also a weight vector can be
## passed in the distWeight parameter but the values must be chosen such
## that they lead to a positive definite kernel
##
## in this example only patterns within a 5 base range are considered with
## slightly decreasing weights
specv <- spectrumKernel(k=3, distWeight=c(1,0.95,0.9,0.85,0.8))
km <- specv(dnaseqs)
km[1:5,1:5]
## position specific spectrum kernel
specps <- spectrumKernel(k=3, distWeight=1)
km <- specps(dnaseqs)
km[1:5,1:5]
## get position specific kernel matrix
km <- specps(dnaseqs)
km[1:5,1:5]
## example with offset to align sequence positions (e.g. the
## transcription start site), the value gives the offset to position 1
positionOne <- c(9,6,3,1,6)
positionMetadata(dnaseqs) <- positionOne
## show position metadata
positionMetadata(dnaseqs)
## generate kernel matrix with position-specific spectrum kernel
km1 <- specps(dnaseqs)
km1[1:5,1:5]
## example for a user defined weighting function
## please stick to the order as described in the comments below and
## make sure that the resulting kernel is positive definite
expWeightUserDefined <- function(x, sigma=1)
{
## check presence and validity of all parameters except for x
if (!isSingleNumber(sigma))
stop("'sigma' must be a number")
## if x is missing the function returns a closure where all parameters
## except for x have a defined value
if (missing(x))
return(function(x) expWeightUserDefined(x, sigma=sigma))
## pattern distance vector x must be numeric
if (!is.numeric(x))
stop("'x' must be a numeric vector")
## create vector of distance weights from the
## input vector of pattern distances x
exp(-abs(x)/sigma)
}
## define kernel object with user defined weighting function
specud <- spectrumKernel(k=3, distWeight=expWeightUserDefined(sigma=5),
normalized=FALSE)
Run the code above in your browser using DataLab