performGridSearch: KeBABS Grid Search

Description

Perform grid search with one or multiple sequence kernels on one or multiple SVMs with one or multiple SVM parameter sets.

Usage

## kbsvm(...., kernel=list(kernel1, kernel2), pkg=pkg1, svm=svm1,
##       cost=cost1, ...., cross=0, noCross=1, ....)
## kbsvm(...., kernel=kernel1, pkg=pkg1, svm=svm1,
##       cost=c(cost1, cost2), ...., cross=0, noCross=1, ....)
## kbsvm(...., kernel=kernel1, pkg=c(pkg1, pkg1, pkg1),
##       svm=c(svm1, svm2, svm3), cost=c(cost1, cost2, cost3), ....,
##       cross=0, noCross=1, ....)
## kbsvm(...., kernel=kernel1, pkg=c(pkg1, pkg2, pkg3),
##       svm=c(svm1, svm2, svm3), cost=c(cost1, cost2, cost3), ....,
##       cross=0, noCross=1, ....)
## kbsvm(...., kernel=list(kernel1, kernel2, kernel3), pkg=c(pkg1, pkg2),
##       svm=c(svm1, svm2), cost=c(cost1, cost2), ...., cross=0,
##       noCross=1, ....)
## for details see below

Arguments

kernel

and other parameters see kbsvm

Value

grid search stores the results in the KeBABS model. They can be retrieved with the accessor modelSelResult{KBModel}.

Details

Overview

To simplify the selection of an appropriate sequence kernel (including setting of the kernel parameters), SVM implementation and setting of SVM hyperparameters KeBABS provides grid search functionality. In addition to the possibility of running the same learning tasks for different settings of the SVM hyperparameters the concept of grid search is seen here in the broader context of finding good values for all major variable parts of the learning task which includes:

selection of the sequence kernel and standard kernel parameters: spectrum, mismatch, gappy pair or motif kernel
selection of the kernel variant: regular, annotation-specific, position-specific or distance weighted kernel variants
selection of the SVM implementation via package and SVM
selection of the SVM hyperparameters for the SVM implementation

KeBABS supports the joint variation of any combination of these learning aspects together with cross validation (CV) to find the best selection based on cross validation performance. After the grid search the performance values of the different settings and the best setting of the grid search run can be retrieved from the KeBABS model with the accessor modelSelResult.

Grid search is started with the method kbsvm by passing multiple values to parameters for which in regular training only a single parameter value is used. Multiple values can be passed for the parameter kernel as list of kernel objects and for the parameters pkg, svm and the hyperparameters of the used SVMs as vectors (numeric or integer vector dependent on the hyperparameter). The parameter cost in the usage section above is just one representative of SVM hyperparameters that can be varied in grid search. Following types of grid search are supported (for examples see below):

variation of one or multiple hyperparameter(s) for a given SVM implementation and one specific kernel by passing hyperparameter values as vectors
variation of the kernel parameters of a single kernel: for the sequence kernels in addition to the standard kernel parameters like k for spectrum or m for gappy pair analysis can be performed in a position-independent or position-dependent manner with multiple distance weighting functions and different parameter settings for the distance weighting functions (see positionMetadata) or with or without annotation specific functionality (see annotationMetadata using one specific or multiple annotations resulting in considerable variation possibilities on the kernel side. The kernel objects for the different parameter settings of the kernel must be precreated and are passed as list to kbsvm. Usually each kernel has the best performance at differernt hyperparameter values. Therefore in general just varying the kernel parameters without varying the hyperparameter values does not make sense but both must be varied together as described below.
variation of multiple SVMs from the same or different R packages with identical or different SVM hyperparameters (dependent on the formulation of the SVM objective) for one specific kernel
combination of the previous three variants as far as runtime allows (see also runtime hints below)

For collecting performance values grid search is organized in a matrix like manner with different kernel objects representing the rows and different hyperparameter settings or SVM and hyperparameter settings as columns of the matrix. If multiple hyperparameters are used on a single SVM the same entry in all hyperparameter vectors is used as one parameter set corresponding to a single column in the grid matrix. The same applies to multiple SVMs, i.e. when multiple SVMs are used from the same package the pkg parameter still must have one entry for each entry in the svm parameter (see examples below). The best performing setting is reported dependent on the performance objective.

Instead of a single training and test cycle for each grid point cross validation should be used to get more representative results. In this case CV is executed for each parameter setting. For larger datasets or kernels with higher complexity the runtime for the full grid search should be limited through adequate selection of the parameter cross.

Performance measures and performance objective

The usual performance measure for grid search is the cross validation error which is stored by default for each grid point. For e.g. non-symmetrical class distribution of the dataset other performance measures can be more expressive. For such sitations also the accuracy, the balanced accuracy and the Matthews correlation coefficient can be stored for a grid point (see parameter perfParameters in kbsvm. (The accuracy corresponds fully to the CV error because it is just the inverted measure. It is included for easier comparability with the balanced accuracy). The performance values can be retrieved from the model selection result object with the accessor performance. The objective for selecting the best performing paramters settings is by default the CV error. With the parameter perfObjective in kbsvm one of the other above mentioned performance parameters can be chosen as objective for the best settings instead of the cross validation error.

Runtime Hints

When parameter showCVTimes in kbsvm is set to TRUE the runtime for the individual cross validation runs is shown for each grid point. In this way quick runtime estimates can be gathered through running the grid search for a reduced grid and extrapolating the runtimes to the full grid. Display of a progress indication in grid search is available with the parameter showProgress in kbsvm.

Dependent on the number of sequences, the complexity of the kernel processing, the type of chosen cross validation and the degree of variation of parameters in grid search the runtime can grow drastically. One possible strategy for reducing the runtime could be a stepwise approach searching for areas with good performance in a first coarse grid search run and then refining the areas of good performance with additional more fine grained grid searches.

The implementation of the sequence kernels was done with a strong focus on runtime performance which brings a considerable improvement compared to other implementations. In KeBABS also an interface to the very fast SVM implementations in package LiblineaR is available. Beyond these performance improvements KeBABS also supports the generation of sparse explicit representations for every sequence kernel which can be used instead of the kernel matrix for learning. In many cases especially with a large number of samples where the kernel matrix would become too large this alternative provides additional dynamical benefits. The current implementation of grid search does not make use of multi-core infrastructures, the entire processing is done on a single core.

References

http://www.bioinf.jku.at/software/kebabs J. Palme, S. Hochreiter, and U. Bodenhofer (2015) KeBABS: an R package for kernel-based analysis of biological sequences. Bioinformatics (accepted). DOI: 10.1093/bioinformatics/btv176.

Examples

Run this code

## load transcription factor binding site data

data(TFBS)
enhancerFB
## The C-svc implementation from LiblineaR is chosen for most of the
## examples because it is the fastest SVM implementation. With SVMs from
## other packages slightly better results could be achievable.
## To get a realistic image of possible performance values, kernel behavior
## and speed of grid search together with 10-fold cross validation a
## resonable number of sequences is needed which would exceed the runtime
## restrictions for automatically executed examples. Therefore the grid
## search examples must be run manually. In these examples we use the full
## dataset for grid search.
train <- sample(1:length(enhancerFB), length(enhancerFB))

## grid search with single kernel object and multiple hyperparameter values
## create gappy pair kernel with normalization
gappyK1M3 <- gappyPairKernel(k=1, m=3)
## show details of single gappy pair kernel object
gappyK1M3

## grid search for a single kernel object and multiple values for cost
pkg <- "LiblineaR"
svm <- "C-svc"
cost <- c(0.01,0.1,1,10,100,1000)
model <- kbsvm(x=enhancerFB[train], y=yFB[train], kernel=gappyK1M3,
               pkg=pkg, svm=svm, cost=cost, explicit="yes", cross=3)

## show grid search results
modelSelResult(model)

## Not run: 
# ## create the list of spectrum kernel objects with normalization and
# ## kernel parameters values for k from 1 to 5
# specK15 <- spectrumKernel(k=1:5)
# ## show details of the four spectrum kernel objects
# specK15
# 
# ## run grid search with several kernel parameter settings for the
# ## spectrum kernel with a single SVM parameter setting
# ## ATTENTION: DO NOT USE THIS VARIANT!
# ## This variant does not bring comparable performance for the different
# ## kernel parameter settings because usually the best performing
# ## hyperparameter values could be quite different for different kernel
# ## parameter settings or between different kernels, grid search for
# ## multiple kernel objects should be done as shown in the next example
# pkg <- "LiblineaR"
# svm <- "C-svc"
# cost <- 2
# model <- kbsvm(x=enhancerFB[train], y=yFB[train], kernel=specK15,
#                pkg=pkg, svm=svm, cost=cost, explicit="yes", cross=10)
# 
# ## show grid search results
# modelSelResult(model)
# 
# ## grid search with multiple kernel objects and multiple values for
# ## hyperparameter cost
# pkg <- "LiblineaR"
# svm <- "C-svc"
# cost <- c(0.01,0.1,1,10,50,100,150,200,500,1000)
# model <- kbsvm(x=enhancerFB, sel=train, y=yFB[train], kernel=specK15,
#                pkg=pkg, svm=svm, cost=cost, explicit="yes", cross=10,
#                showProgress=TRUE)
# 
# ## show grid search results
# modelSelResult(model)
# 
# ## grid search for a single kernel object with multiple SVMs
# ## from different packages
# ## here with display of cross validation runtimes for each grid point
# ## pkg, svm and cost vectors must have same length and the corresponding
# ## entry in each of these vectors are one SVM + SVM hyperparameter setting
# pkg <- rep(c("kernlab", "e1071", "LiblineaR"),3)
# svm <- rep("C-svc", 9)
# cost <- rep(c(0.01,0.1,1),each=3)
# model <- kbsvm(x=enhancerFB[train], y=yFB[train], kernel=gappyK1M3,
#                pkg=pkg, svm=svm, cost=cost, explicit="yes", cross=3,
#                showCVTimes=TRUE)
# 
# ## show grid search results
# modelSelResult(model)
# 
# ## run grid search for a single kernel with multiple SVMs from same package
# ## here all from LiblineaR: C-SVM, L2 regularized SVM with L2 loss and
# ## SVM with L1 regularization and L2 loss
# ## attention: for different formulation of the SMV objective use different
# ## values for the hyperparameters even if they have the same name
# pkg <- rep("LiblineaR", 9)
# svm <- rep(c("C-svc","l2rl2l-svc","l1rl2l-svc"), each=3)
# cost <- c(1,150,1000,1,40,100,1,40,100)
# model <- kbsvm(x=enhancerFB, sel=train, y=yFB[train], kernel=gappyK1M3,
#                pkg=pkg, svm=svm, cost=cost, explicit="yes", cross=3)
# 
# ## show grid search results
# modelSelResult(model)
# 
# ## create the list of kernel objects for gappy pair kernel
# gappyK1M15 <- gappyPairKernel(k=1, m=1:5)
# ## show details of kernel objects
# gappyK1M15
# 
# ## run grid search with progress indication with ten kernels and ten
# ## hyperparameter values for cost and 10 fold cross validation on full
# ## dataset (500 samples)
# pkg <- rep("LiblineaR", 10)
# svm <- rep("C-svc", 10)
# cost <- c(0.0001,0.001,0.01,0.1,1,10,100,1000,10000,100000)
# model <- kbsvm(x=enhancerFB, y=yFB, kernel=c(specK15, gappyK1M15),
#                pkg=pkg, svm=svm, cost=cost, cross=10, explicit="yes",
#                showCVTimes=TRUE, showProgress=TRUE)
# 
# ## show grid search results
# modelSelResult(model)
# ## End(Not run)

Run the code above in your browser using DataLab