getExRep: Explict Representation

Description

Create an explicit representation

Usage

getExRep(x, kernel = spectrumKernel(), sparse = TRUE,
  zeroFeatures = FALSE, features = NULL, useRowNames = TRUE,
  useColNames = TRUE, selx = NULL)
getExRepQuadratic(exRepLin, useRowNames = TRUE, useColNames = TRUE,
  zeroFeatures = FALSE)

Arguments

one or multiple biological sequences in the form of a DNAStringSet, RNAStringSet, AAStringSet (or as BioVector)

kernel

a sequence kernel object. The feature map of this kernel object is used to generate the explicit representation.

sparse

boolean that indicates whether a sparse or dense explicit representation should be generated. Default=TRUE

zeroFeatures

indicates whether columns with zero feature counts across all samples should be included in the explicit representation. (see below) Default=FALSE

features

feature subset of the specified kernel in the form of a character vector. When a feature subset is passed to the function all other features in the feature space are not considered for the explicit representation. (see below)

useRowNames

if this parameter is set the sample names will be set as row names if available in the provided sequence set. Default=TRUE

useColNames

if this parameter is set the features will be set as column names in the explicit representation. Default=TRUE

selx

subset of indices into x. When this parameter is present the explicit representation is generated for the specified subset of samples only. default=NULL

exRepLin

a linear explicit representation

Value

getExRep: upon successful completion, dependent on the flag sparse the function returns either a dense explicit representation of class ExplicitRepresentationDense or a sparse explicit representation of class ExplicitRepresentationSparse.
getExRepQuadratic: upon successful completion, the function returns a quadratic explicit representation

Details

Creation of an explicit representation The function 'getExRep' creates an explicit representation of the given sequence set using the feature map of the specified kernel. It contains the feature counts in a matrix format. The rows of the matrix represent the samples, the columns the features. For a dense explicit representation of class ExplicitRepresentationDense the count data is stored in a dense matrix. To allow efficient storage all features that do not occur in the sequence set are removed from the explicit representation by default. When the parameter zeroFeatures is set to TRUE these features are also included resulting an explicit representation which contains the full feature space. For feature spaces larger than one million features the inclusion of zero features is not possible. In case of large feature spaces a sparse explicit representation of class ExplicitRepresentationSparse is much more efficient by storing the count data as dgRMatrix from package Matrix). The class ExplicitRepresentationSparse is derived from dgRMatrix. As zero features are not stored in a sparse matrix the flag zeroFeatures only controls whether the column names of features not occuring in the sequences are included or not. Both the dense and the sparse explicit representation also contain the kernel object which was used for it's creation. For an explicit representation without zero features column names are mandatory. An explicit representation can be created for position independent and annotation specific kernel variants (for details see annotationMetadata). In annotation specific kernels the annotation characters are included as postfix in the features. For kernels with normalization the explicit representation is normalized resulting in row vectors normalized to the unit sphere. For feature subsets used with normalized kernels all features of the feature space are used in the normalization. Usage of explicit representations Learning with linear SVMs (e.g. ksvmin package kernlab or svm in package e1071) can be performed either through passing a kernel matrix of similarity values or an explicit representation and a linear kernel to the SVM. The SVMs in package kernlab support a dense explicit representation or kernel matrix as data representations. The SVMs in packages e1071) and LiblineaR support dense or sparse explicit representations. In many cases there can be considerable performance differences between the two variants of passing data to the SVM. And especially for larger feature spaces the sparse explicit representation not only brings higher memory efficiency but also leads to drastically improved runtimes during training and prediction. Starting with kebabs version 1.2.0 kernel matrix support is also available for package e1071 via the dense LIBSVM implementation integrated in package kebabs. In general all of the complexity of converting the sequences with a specific kernel to an explicit representation or a kernel matrix and adapting the formats and parameters to the specific SVM is hidden within the KeBABS training and predict methods (see kbsvm, predict) and the user can concentrate on the actual data analysis task. During training via kbsvm the parameter explicit controls the training via kernel matrix or explicit representation and the parameter explicitType determines whether a dense or sparse explicit representation is used. Manual generation of explicit representations is only necessary for usage with other learners or analysis methods not supported by KeBABS. Quadratic explicit representation The package LiblineaR only provides linear SVMs which are tuned for efficient processing of larger feature spaces and sample numbers. To allow the use of a quadratic kernel on these SVMs a quadratic explicit representation can be generated from the linear explicit representation. It contains counts for feature pairs and the features combined to one pair are separated by '_' in the column names of the quadratic explicit representation. Please be aware that the dimensionality for a quadratic explicit representation increases considerably compared to the linear one. In the other SVMs a linear explicit representation together with a quadratic kernel is used instead. In training via kbsvm the use of a linear representation with a quadratic kernel or a quadratic explicit representation instead is indicated through setting the parameter featureType to the value "quadratic".

References

http://www.bioinf.jku.at/software/kebabs J. Palme, S. Hochreiter, and U. Bodenhofer (2015) KeBABS: an R package for kernel-based analysis of biological sequences. Bioinformatics, 31(15):2574-2576, 2015. DOI: http://dx.doi.org/10.1093/bioinformatics/btv176{10.1093/bioinformatics/btv176}.

Examples

Run this code

## instead of user provided sequences in XStringSet format
## for this example a set of DNA sequences is created
## RNA- or AA-sequences can be used as well with the spectrum kernel
dnaseqs <- DNAStringSet(c("AGACTTAAGGGACCTGGACACCACACTCAGCTAGGGGGACTGGGAGC",
                          "ATAAAGGGAGCAGACATCATGACCTTTTTGACCCTAATTATTTCAGC",
                          "CAGGAATCAGCACAGGCAGGGGCACTGCATCCCAAGACATCTGGGCC",
                          "GGACATATACCCACCCTTACCTGCCATACAGGATAGGGCCACTGCCC",
                          "ATAAAGGATGCAGACATCATGGCCTTTTTGACCCTAATTATTTCAGC"))
names(dnaseqs) <- paste("S", 1:length(dnaseqs), sep="")

## create the kernel object for dimers with normalization
speck <- spectrumKernel(k=2)
## show details of kernel object
speck

## generate the dense explicit representation for the kernel
erd <- getExRep(dnaseqs, speck, sparse=FALSE)
dim(erd)
erd[1:5,]

## generate the dense explicit representation with zero features
erd <- getExRep(dnaseqs, speck, sparse=FALSE, zeroFeatures=TRUE)
dim(erd)
erd[1:5,]

## generate the sparse explicit representation for the kernel
ers <- getExRep(dnaseqs, speck)
dim(ers)
ers[1:5,]

## generate the sparse explicit representation with zero features
ers <- getExRep(dnaseqs, speck, zeroFeatures=TRUE)
dim(ers)
ers[1:5,]

## generate the quadratic explicit representation
erdq <- getExRepQuadratic(erd)
dim(erdq)
erdq[1:5,1:15]

## run taining and prediction with dense linear explicit representation
data(TFBS)
enhancerFB
train <- sample(1:length(enhancerFB), length(enhancerFB) * 0.7)
test <- c(1:length(enhancerFB))[-train]
model <- kbsvm(x=enhancerFB[train], y=yFB[train], kernel=speck,
               pkg="LiblineaR", svm="C-svc", cost=10, explicit="yes",
               explicitType="dense")
pred <- predict(model, x=enhancerFB[test])
evaluatePrediction(pred, yFB[test], allLabels=unique(yFB))

## run taining and prediction with sparse linear explicit representation
model <- kbsvm(x=enhancerFB[train], y=yFB[train], kernel=speck,
               pkg="LiblineaR", svm="C-svc", cost=10, explicit="yes",
               explicitType="sparse")
pred <- predict(model, x=enhancerFB[test])
evaluatePrediction(pred, yFB[test], allLabels=unique(yFB))

Run the code above in your browser using DataLab