classiKernel: Create a kernel estimator for functional data classification

Description

Creates an efficient kernel estimator for functional data classification. Currently supported distance measures are all metrics implemented in dist and all semimetrics suggested in Fuchs et al. (2015). Additionally, all (semi-)metrics can be used on a derivative of arbitrary order of the functional observations. For kernel functions all kernels implemented in fda.usc are available as well as custom kernel functions.

Usage

classiKernel(classes, fdata, grid = 1:ncol(fdata), h = 1, metric = "L2",
  ker = "Ker.norm", nderiv = 0L, derived = FALSE,
  deriv.method = "base.diff", custom.metric = function(x, y, ...) {    
  return(sqrt(sum((x - y)^2))) }, custom.ker = function(u) {    
  return(dnorm(u)) }, ...)

Arguments

classes

[factor(nrow(fdata))] factor of length nrow(fdata) containing the classes of the observations.

fdata

[matrix] matrix containing the functional observations as rows.

grid

[numeric(ncol(fdata))] numeric vector of length ncol(fdata) containing the grid on which the functional observations were evaluated.

[numeric(1)] controls the bandwidth of the kernel function. All kernel functions ker should be implemented to have bandwidth = 1. The bandwidth is controlled via h by using K(x) = ker(x/h) as the kernel function.

metric

[character(1)] character string specifying the (semi-)metric to be used. For a an overview of what is available see the method argument in computeDistMat. For a full list execute metricChoices().

ker

[numeric(1)] character string describing the kernel function to use. Available are amongst others all kernel functions from Kernel. For the full list execute kerChoices(). The usage of customized kernel function is symbolized by ker = "custom.ker". The customized function can be specified in custom.ker

nderiv

[integer(1)] order of derivation on which the metric shall be computed. The default is 0L.

derived

[logical(1)] Is the data given in fdata already derived? Default is set to FALSE, which will lead to numerical derivation if nderiv >= 1L by applying deriv.fd on a Data2fd representation of fdata.

deriv.method

[character(1)] character indicate which method should be used for derivation. Currently implemented are "base.diff", the default, and "fda.deriv.fd". "base.diff" uses the method base::diff for equidistant measures without missing values, which is faster than transforming the data into the class fd and deriving this using fda::deriv.fd. The second variant implies smoothing, which can be preferable for calculating high order derivatives.

custom.metric

[function(x, y, ...)] only used if deriv.method = "custom.method". A function of functional observations x and y returning their distance. The default is the L2 distance. See how to implement your distance function in dist.

custom.ker

[function(u)] customized kernel function. This has to be a function with exactly one parameter u, returning the numeric value of the kernel function ker(u). This function is only used if ker == "custom.ker". The bandwidth should be constantly equal to 1 and is controlled via h.

...

further arguments to and from other methods. Hand over additional arguments to computeDistMat, usually additional arguments for the specified (semi-)metric. Also, if deriv.method == "fda.deriv.fd" or fdata is not observed on a regular grid, additional arguments to fdataTransform can be specified which will be passed on to Data2fd.

Value

classiKernel returns an object of class 'classiKernel'. This is a list containing at least the following components:

classes: a factor of length nrow(fdata) coding the response of the training data set.
fdata: the raw functional data as a matrix with the individual observations as rows.
proc.fdata: the preprocessed data (missing values interpolated, derived and evenly spaced). This data is this.fdataTransform(fdata). See this.fdataTransform for more details.
grid: numeric vector containing the grid on which fdata is observed)
h: numeric value giving the bandwidth to be used in the kernel function.
ker: character encoding the kernel function to use.
metric: character string coding the distance metric to be used in computeDistMat.
nderiv: integer giving the order of derivation that is applied to fdata before computing the distances between the observations.
this.fdataTransform: preprocessing function taking new data as a matrix. It is used to transform fdata into proc.fdata and is required to preprocess new data in order to predict it. This function ensures, that preprocessing (derivation, respacing and interpolation of missing values) is done in the exact same way for the original training data set and future (test) data sets.
call: the original function call.

References

Fuchs, K., J. Gertheiss, and G. Tutz (2015): Nearest neighbor ensembles for functional data with interpretable feature selection. Chemometrics and Intelligent Laboratory Systems 146, 186 - 197.

Examples

Run this code

# NOT RUN {
# How to implement your own kernel function
data("ArrowHead")
classes = ArrowHead[,"target"]

set.seed(123)
train_inds = sample(1:nrow(ArrowHead), size = 0.8 * nrow(ArrowHead), replace = FALSE)
test_inds = (1:nrow(ArrowHead))[!(1:nrow(ArrowHead)) %in% train_inds]

ArrowHead = ArrowHead[,!colnames(ArrowHead) == "target"]

# custom kernel
myTriangularKernel = function(u) {
  return((1 - abs(u)) * (abs(u) < 1))
}

# create the model
mod1 = classiKernel(classes = classes[train_inds], fdata = ArrowHead[train_inds,],
                    ker = "custom.ker", h = 2, custom.ker = myTriangularKernel)

# calculate the model predictions
pred1 = predict(mod1, newdata = ArrowHead[test_inds,], predict.type = "response")

# prediction accuracy
mean(pred1 == classes[test_inds])

# create another model using an existing kernel function
mod2 = classiKernel(classes = classes[train_inds], fdata = ArrowHead[train_inds,],
                    ker = "Ker.tri", h = 2)

# calculate the model predictions
pred2 = predict(mod1, newdata = ArrowHead[test_inds,], predict.type = "response")

# prediction accuracy
mean(pred2 == classes[test_inds])
# }
# NOT RUN {
# Parallelize across 2 CPU's
library(parallelMap)
parallelStartSocket(2L) # parallelStartMulticore for Linux
predict(mod1, newdata =  fdata[test_inds,], predict.type = "prob", parallel = TRUE, batches = 2L)
parallelStop()
# }