npksum
exists so that you can create your own kernel objects with
or without a variable to be weighted (default \(Y=1\)). With the options
available, you could create new nonparametric tests or even new kernel
estimators. The convolution kernel option would allow you to create,
say, the least squares cross-validation function for kernel density
estimation.
npksum uses highly-optimized C code that strives to minimize
its ‘memory footprint’, while there is low overhead involved
when using repeated calls to this function (see, by way of
illustration, the example below that conducts leave-one-out
cross-validation for a local constant regression estimator via calls
to the R function nlm, and compares this to the
npregbw function).
npksum implements a variety of methods for computing
multivariate kernel sums (\(p\)-variate) defined over a set of
possibly continuous and/or discrete (unordered, ordered) data. The
approach is based on Li and Racine (2003) who employ
‘generalized product kernels’ that admit a mix of continuous
and discrete data types.
Three classes of kernel estimators for the continuous data types are
available: fixed, adaptive nearest-neighbor, and generalized
nearest-neighbor. Adaptive nearest-neighbor bandwidths change with
each sample realization in the set, \(x_i\), when estimating
the kernel sum at the point \(x\). Generalized nearest-neighbor
bandwidths change with the point at which the sum is computed,
\(x\). Fixed bandwidths are constant over the support of \(x\).
npksum computes \(\sum_{j=1}^{n}{W_j^\prime Y_j
K(X_j)}\), where \(W_j\)
represents a row vector extracted from \(W\). That is, it computes
the kernel weighted sum of the outer product of the rows of \(W\)
and \(Y\). In the examples, the uses of such sums are illustrated.
npksum may be invoked either with a formula-like
symbolic
description of variables on which the sum is to be
performed or through a simpler interface whereby data is passed
directly to the function via the txdat and tydat
parameters. Use of these two interfaces is mutually exclusive.
Data contained in the data frame txdat (and also exdat)
may be a mix of continuous (default), unordered discrete (to be
specified in the data frame txdat using the
factor command), and ordered discrete (to be specified
in the data frame txdat using the ordered
command). Data can be entered in an arbitrary order and data types
will be detected automatically by the routine (see npRmpi
for details).
Data for which bandwidths are to be estimated may be specified
symbolically. A typical description has the form dependent data
~ explanatory data, where dependent data and explanatory
data are both series of variables specified by name, separated by the
separation character '+'. For example, y1 ~ x1 + x2 specifies
that y1 is to be kernel-weighted by x1 and x2
throughout the sum. See below for further examples.
A variety of kernels may be specified by the user. Kernels implemented
for continuous data types include the second, fourth, sixth, and
eighth order Gaussian and Epanechnikov kernels, and the uniform
kernel. Unordered discrete data types use a variation on Aitchison and
Aitken's (1976) kernel, while ordered data types use a variation of
the Wang and van Ryzin (1981) kernel (see npRmpi for
details).
The option operator= can be used to ‘mix and match’
operator strings to create a ‘hybrid’ kernel provided they
match the dimension of the data. For example, for a two-dimensional
data frame of numeric datatypes,
operator=c("normal","derivative") will use the normal
(i.e. PDF) kernel for variable one and the derivative of the PDF
kernel for variable two. Please note that applying operators will scale the
results by factors of \(h\) or \(1/h\) where appropriate.
The option permutation.operator= can be used to ‘mix and match’
operator strings to create a ‘hybrid’ kernel, in addition to
the kernel sum with no operators applied, one for each continuous
dimension in the data. For example, for a two-dimensional
data frame of numeric datatypes,
permutation.operator=c("derivative") will return the usual
kernel sum as if operator = c("normal","normal") in the
ksum member, and in the p.ksum member, it will return
kernel sums for operator = c("derivative","normal"), and
operator = c("normal","derivative"). This makes the computation
of gradients much easier.
The option compute.score= can be used to compute the gradients
with respect to \(h\) in addition to the normal kernel sum. Like
permutations, the additional results are returned in the
p.ksum. This option does not work in conjunction with
permutation.operator.
The option compute.ocg= works much like permutation.operator,
but for discrete variables. The kernel is evaluated at a reference
category in each dimension: for ordered data, the next lowest category
is selected, except in the case of the lowest category, where the
second lowest category is selected; for unordered data, the first
category is selected. These additional data are returned in the
p.ksum member. This option can be set simultaneously with
permutation.operator.
The option return.kernel.weights=TRUE returns a matrix of
dimension ‘number of training observations’ by ‘number
of evaluation observations’ and contains only the generalized product
kernel weights ignoring all other objects and options that may be
provided to npksum (e.g. bandwidth.divide=TRUE will be
ignored, etc.). Summing the columns of the weight matrix and dividing
by ‘number of training observations’ times the product of the
bandwidths (i.e. colMeans(foo$kw)/prod(h)) would produce
the kernel estimator of a (multivariate) density
(operator="normal") or multivariate cumulative distribution
(operator="integral").