Last chance! 50% off unlimited learning
Sale ends in
## kbsvm(...., kernel=list(kernel1, kernel2), pkg=pkg1, svm=svm1,
## cost=cost1, ...., cross=0, noCross=1, ....)
## kbsvm(...., kernel=kernel1, pkg=pkg1, svm=svm1,
## cost=c(cost1, cost2), ...., cross=0, noCross=1, ....)
## kbsvm(...., kernel=kernel1, pkg=c(pkg1, pkg1, pkg1),
## svm=c(svm1, svm2, svm3), cost=c(cost1, cost2, cost3), ....,
## cross=0, noCross=1, ....)
## kbsvm(...., kernel=kernel1, pkg=c(pkg1, pkg2, pkg3),
## svm=c(svm1, svm2, svm3), cost=c(cost1, cost2, cost3), ....,
## cross=0, noCross=1, ....)
## kbsvm(...., kernel=list(kernel1, kernel2, kernel3), pkg=c(pkg1, pkg2),
## svm=c(svm1, svm2), cost=c(cost1, cost2), ...., cross=0,
## noCross=1, ....)
## for details see below
kbsvm
modelSelResult{KBModel}
.
To simplify the selection of an appropriate sequence kernel (including setting of the kernel parameters), SVM implementation and setting of SVM hyperparameters KeBABS provides grid search functionality. In addition to the possibility of running the same learning tasks for different settings of the SVM hyperparameters the concept of grid search is seen here in the broader context of finding good values for all major variable parts of the learning task which includes:
KeBABS supports the joint variation of any combination of
these learning aspects together with cross validation (CV)
to find the best selection based on cross validation
performance. After the grid search the performance values
of the different settings and the best setting of the grid
search run can be retrieved from the KeBABS model with the
accessor modelSelResult
.
Grid search is started with the method kbsvm
by passing multiple values to parameters for which in
regular training only a single parameter value is used.
Multiple values can be passed for the parameter
kernel
as list of kernel objects and for the
parameters pkg
, svm
and the hyperparameters
of the used SVMs as vectors (numeric or integer vector
dependent on the hyperparameter). The parameter cost in the
usage section above is just one representative of SVM
hyperparameters that can be varied in grid search.
Following types of grid search are supported (for examples
see below):
positionMetadata
) or with or without
annotation specific functionality (see
annotationMetadata
using one specific or
multiple annotations resulting in considerable variation
possibilities on the kernel side. The kernel objects for
the different parameter settings of the kernel must be
precreated and are passed as list to kbsvm
.
Usually each kernel has the best performance at differernt
hyperparameter values. Therefore in general just varying
the kernel parameters without varying the hyperparameter
values does not make sense but both must be varied together
as described below.
For collecting performance values grid search is organized
in a matrix like manner with different kernel objects
representing the rows and different hyperparameter settings
or SVM and hyperparameter settings as columns of the
matrix. If multiple hyperparameters are used on a single
SVM the same entry in all hyperparameter vectors is used as
one parameter set corresponding to a single column in the
grid matrix. The same applies to multiple SVMs, i.e. when
multiple SVMs are used from the same package the pkg
parameter still must have one entry for each entry in the
svm
parameter (see examples below). The best
performing setting is reported dependent on the performance
objective.
Instead of a single training and test cycle for each grid
point cross validation should be used to get more
representative results. In this case CV is executed for
each parameter setting. For larger datasets or kernels with
higher complexity the runtime for the full grid search
should be limited through adequate selection of the
parameter cross
.
Performance measures and performance objective
The usual performance measure for grid search is the cross
validation error which is stored by default for each grid
point. For e.g. non-symmetrical class distribution of the
dataset other performance measures can be more expressive.
For such sitations also the accuracy, the balanced accuracy
and the Matthews correlation coefficient can be stored for
a grid point (see parameter perfParameters
in
kbsvm
. (The accuracy corresponds fully to the
CV error because it is just the inverted measure. It is
included for easier comparability with the balanced
accuracy). The performance values can be retrieved from the
model selection result object with the accessor
performance
. The objective for selecting the
best performing paramters settings is by default the CV
error. With the parameter perfObjective
in
kbsvm
one of the other above mentioned
performance parameters can be chosen as objective for the
best settings instead of the cross validation error.
Runtime Hints
When parameter showCVTimes
in kbsvm
is
set to TRUE the runtime for the individual cross validation
runs is shown for each grid point. In this way quick
runtime estimates can be gathered through running the grid
search for a reduced grid and extrapolating the runtimes to
the full grid. Display of a progress indication in grid
search is available with the parameter showProgress
in kbsvm
.
Dependent on the number of sequences, the complexity of the kernel processing, the type of chosen cross validation and the degree of variation of parameters in grid search the runtime can grow drastically. One possible strategy for reducing the runtime could be a stepwise approach searching for areas with good performance in a first coarse grid search run and then refining the areas of good performance with additional more fine grained grid searches.
The implementation of the sequence kernels was done with a strong focus on runtime performance which brings a considerable improvement compared to other implementations. In KeBABS also an interface to the very fast SVM implementations in package LiblineaR is available. Beyond these performance improvements KeBABS also supports the generation of sparse explicit representations for every sequence kernel which can be used instead of the kernel matrix for learning. In many cases especially with a large number of samples where the kernel matrix would become too large this alternative provides additional dynamical benefits. The current implementation of grid search does not make use of multi-core infrastructures, the entire processing is done on a single core.
kbsvm
, spectrumKernel
,
mismatchKernel
,
gappyPairKernel
, motifKernel
,
positionMetadata
,
annotationMetadata
,
performModelSelection
## load transcription factor binding site data
data(TFBS)
enhancerFB
## The C-svc implementation from LiblineaR is chosen for most of the
## examples because it is the fastest SVM implementation. With SVMs from
## other packages slightly better results could be achievable.
## To get a realistic image of possible performance values, kernel behavior
## and speed of grid search together with 10-fold cross validation a
## resonable number of sequences is needed which would exceed the runtime
## restrictions for automatically executed examples. Therefore the grid
## search examples must be run manually. In these examples we use the full
## dataset for grid search.
train <- sample(1:length(enhancerFB), length(enhancerFB))
## grid search with single kernel object and multiple hyperparameter values
## create gappy pair kernel with normalization
gappyK1M3 <- gappyPairKernel(k=1, m=3)
## show details of single gappy pair kernel object
gappyK1M3
## grid search for a single kernel object and multiple values for cost
pkg <- "LiblineaR"
svm <- "C-svc"
cost <- c(0.01,0.1,1,10,100,1000)
model <- kbsvm(x=enhancerFB[train], y=yFB[train], kernel=gappyK1M3,
pkg=pkg, svm=svm, cost=cost, explicit="yes", cross=3)
## show grid search results
modelSelResult(model)
## Not run:
# ## create the list of spectrum kernel objects with normalization and
# ## kernel parameters values for k from 1 to 5
# specK15 <- spectrumKernel(k=1:5)
# ## show details of the four spectrum kernel objects
# specK15
#
# ## run grid search with several kernel parameter settings for the
# ## spectrum kernel with a single SVM parameter setting
# ## ATTENTION: DO NOT USE THIS VARIANT!
# ## This variant does not bring comparable performance for the different
# ## kernel parameter settings because usually the best performing
# ## hyperparameter values could be quite different for different kernel
# ## parameter settings or between different kernels, grid search for
# ## multiple kernel objects should be done as shown in the next example
# pkg <- "LiblineaR"
# svm <- "C-svc"
# cost <- 2
# model <- kbsvm(x=enhancerFB[train], y=yFB[train], kernel=specK15,
# pkg=pkg, svm=svm, cost=cost, explicit="yes", cross=10)
#
# ## show grid search results
# modelSelResult(model)
#
# ## grid search with multiple kernel objects and multiple values for
# ## hyperparameter cost
# pkg <- "LiblineaR"
# svm <- "C-svc"
# cost <- c(0.01,0.1,1,10,50,100,150,200,500,1000)
# model <- kbsvm(x=enhancerFB, sel=train, y=yFB[train], kernel=specK15,
# pkg=pkg, svm=svm, cost=cost, explicit="yes", cross=10,
# showProgress=TRUE)
#
# ## show grid search results
# modelSelResult(model)
#
# ## grid search for a single kernel object with multiple SVMs
# ## from different packages
# ## here with display of cross validation runtimes for each grid point
# ## pkg, svm and cost vectors must have same length and the corresponding
# ## entry in each of these vectors are one SVM + SVM hyperparameter setting
# pkg <- rep(c("kernlab", "e1071", "LiblineaR"),3)
# svm <- rep("C-svc", 9)
# cost <- rep(c(0.01,0.1,1),each=3)
# model <- kbsvm(x=enhancerFB[train], y=yFB[train], kernel=gappyK1M3,
# pkg=pkg, svm=svm, cost=cost, explicit="yes", cross=3,
# showCVTimes=TRUE)
#
# ## show grid search results
# modelSelResult(model)
#
# ## run grid search for a single kernel with multiple SVMs from same package
# ## here all from LiblineaR: C-SVM, L2 regularized SVM with L2 loss and
# ## SVM with L1 regularization and L2 loss
# ## attention: for different formulation of the SMV objective use different
# ## values for the hyperparameters even if they have the same name
# pkg <- rep("LiblineaR", 9)
# svm <- rep(c("C-svc","l2rl2l-svc","l1rl2l-svc"), each=3)
# cost <- c(1,150,1000,1,40,100,1,40,100)
# model <- kbsvm(x=enhancerFB, sel=train, y=yFB[train], kernel=gappyK1M3,
# pkg=pkg, svm=svm, cost=cost, explicit="yes", cross=3)
#
# ## show grid search results
# modelSelResult(model)
#
# ## create the list of kernel objects for gappy pair kernel
# gappyK1M15 <- gappyPairKernel(k=1, m=1:5)
# ## show details of kernel objects
# gappyK1M15
#
# ## run grid search with progress indication with ten kernels and ten
# ## hyperparameter values for cost and 10 fold cross validation on full
# ## dataset (500 samples)
# pkg <- rep("LiblineaR", 10)
# svm <- rep("C-svc", 10)
# cost <- c(0.0001,0.001,0.01,0.1,1,10,100,1000,10000,100000)
# model <- kbsvm(x=enhancerFB, y=yFB, kernel=c(specK15, gappyK1M15),
# pkg=pkg, svm=svm, cost=cost, cross=10, explicit="yes",
# showCVTimes=TRUE, showProgress=TRUE)
#
# ## show grid search results
# modelSelResult(model)
# ## End(Not run)
Run the code above in your browser using DataLab