npksum: Kernel Sums with Mixed Data Types

Description

npksum computes kernel sums on evaluation data, given a set of training data, data to be weighted (optional), and a bandwidth specification (any bandwidth object).

Usage

npksum(...)
# S3 method for formula
npksum(formula, data, newdata, subset, na.action, ...)
# S3 method for default
npksum(bws,
       txdat = stop("training data 'txdat' missing"),
       tydat = NULL,
       exdat = NULL,
       weights = NULL,
       leave.one.out = FALSE,
       kernel.pow = 1.0,
       bandwidth.divide = FALSE,
       operator = names(ALL_OPERATORS),
       permutation.operator = names(PERMUTATION_OPERATORS),
       compute.score = FALSE,
       compute.ocg = FALSE,
       return.kernel.weights = FALSE,
       ...)
# S3 method for numeric
npksum(bws,
       txdat = stop("training data 'txdat' missing"),
       tydat,
       exdat,
       weights,
       leave.one.out,
       kernel.pow,
       bandwidth.divide,
       operator,
       permutation.operator,
       compute.score,
       compute.ocg,
       return.kernel.weights,
       ...)

Value

npksum returns a npkernelsum object with the following components:

eval: the evaluation points
ksum: the sum at the evaluation points
kw: the kernel weights (when return.kernel.weights=TRUE is specified)

Arguments

formula: a symbolic description of variables on which the sum is to be performed. The details of constructing a formula are described below.
data: an optional data frame, list or environment (or object coercible to a data frame by as.data.frame) containing the variables in the model. If not found in data, the variables are taken from environment(formula), typically the environment from which the function is called.
newdata: An optional data frame in which to look for evaluation data. If omitted, data is used.
subset: an optional vector specifying a subset of observations to be used.
na.action: a function which indicates what should happen when the data contain NAs. The default is set by the na.action setting of options, and is na.fail if that is unset. The (recommended) default is na.omit.
...: additional arguments supplied to specify the parameters to the default S3 method, which is called during estimation.
txdat: a $p$-variate data frame of sample realizations (training data) used to compute the sum.
tydat: a numeric vector of data to be weighted. The $i$th kernel weight is applied to the $i$th element. Defaults to 1.
exdat: a $p$-variate data frame of sum evaluation points (if omitted, defaults to the training data itself).
bws: a bandwidth specification. This can be set as any suitable bandwidth object returned from a bandwidth-generating function, or a numeric vector.
weights: a $n$ by $q$ matrix of weights which can optionally be applied to tydat in the sum. See details.
leave.one.out: a logical value to specify whether or not to compute the leave one out sums. Will not work if exdat is specified. Defaults to FALSE.
kernel.pow: an integer specifying the power to which the kernels will be raised in the sum. Defaults to 1.
bandwidth.divide: a logical specifying whether or not to divide continuous kernel weights by their bandwidths. Use this with nearest-neighbor methods. Defaults to FALSE.
operator: a string specifying whether the normal, convolution, derivative, or integral kernels are to be used. Operators scale results by factors of $h$ or $1/h$ where appropriate. Defaults to normal and applies to all elements in a multivariate object. See details.
permutation.operator: a string which can have a value of none, normal, derivative, or integral. If set to something other than none (the default), then a separate result will be returned for each term in the product kernel, with the operator applied to that term. Permutation operators scale results by factors of $h$ or $1/h$ where appropriate. This is useful for computing gradients, for example.
compute.score: a logical specifying whether or not to return the score (the ‘grad h’ terms) for each dimension in addition to the kernel sum. Cannot be TRUE if a permutation operator other than "none" is selected.
compute.ocg: a logical specifying whether or not to return a separate result for each unordered and ordered dimension, where the product kernel term for that dimension is evaluated at an appropriate reference category. This is used primarily in npRmpi to compute ordered and unordered categorical gradients. See details.
return.kernel.weights: a logical specifying whether or not to return the matrix of generalized product kernel weights. Defaults to FALSE. See details.

Author

Tristen Hayfield tristen.hayfield@gmail.com, Jeffrey S. Racine racinej@mcmaster.ca

Usage Issues

If you are using data of mixed types, then it is advisable to use the data.frame function to construct your input data and not cbind, since cbind will typically not work as intended on mixed data types and will coerce the data to the same type.

Details

npksum exists so that you can create your own kernel objects with or without a variable to be weighted (default $Y=1$). With the options available, you could create new nonparametric tests or even new kernel estimators. The convolution kernel option would allow you to create, say, the least squares cross-validation function for kernel density estimation.

npksum uses highly-optimized C code that strives to minimize its ‘memory footprint’, while there is low overhead involved when using repeated calls to this function (see, by way of illustration, the example below that conducts leave-one-out cross-validation for a local constant regression estimator via calls to the R function nlm, and compares this to the npregbw function).

npksum implements a variety of methods for computing multivariate kernel sums ($p$-variate) defined over a set of possibly continuous and/or discrete (unordered, ordered) data. The approach is based on Li and Racine (2003) who employ ‘generalized product kernels’ that admit a mix of continuous and discrete data types.

Three classes of kernel estimators for the continuous data types are available: fixed, adaptive nearest-neighbor, and generalized nearest-neighbor. Adaptive nearest-neighbor bandwidths change with each sample realization in the set, $x_i$, when estimating the kernel sum at the point $x$. Generalized nearest-neighbor bandwidths change with the point at which the sum is computed, $x$. Fixed bandwidths are constant over the support of $x$.

npksum computes $\sum_{j=1}^{n}{W_j^\prime Y_j K(X_j)}$, where $W_j$ represents a row vector extracted from $W$. That is, it computes the kernel weighted sum of the outer product of the rows of $W$ and $Y$. In the examples, the uses of such sums are illustrated.

npksum may be invoked either with a formula-like symbolic description of variables on which the sum is to be performed or through a simpler interface whereby data is passed directly to the function via the txdat and tydat parameters. Use of these two interfaces is mutually exclusive.

Data contained in the data frame txdat (and also exdat) may be a mix of continuous (default), unordered discrete (to be specified in the data frame txdat using the factor command), and ordered discrete (to be specified in the data frame txdat using the ordered command). Data can be entered in an arbitrary order and data types will be detected automatically by the routine (see npRmpi for details).

Data for which bandwidths are to be estimated may be specified symbolically. A typical description has the form dependent data ~ explanatory data, where dependent data and explanatory data are both series of variables specified by name, separated by the separation character '+'. For example, y1 ~ x1 + x2 specifies that y1 is to be kernel-weighted by x1 and x2 throughout the sum. See below for further examples.

A variety of kernels may be specified by the user. Kernels implemented for continuous data types include the second, fourth, sixth, and eighth order Gaussian and Epanechnikov kernels, and the uniform kernel. Unordered discrete data types use a variation on Aitchison and Aitken's (1976) kernel, while ordered data types use a variation of the Wang and van Ryzin (1981) kernel (see npRmpi for details).

The option operator= can be used to ‘mix and match’ operator strings to create a ‘hybrid’ kernel provided they match the dimension of the data. For example, for a two-dimensional data frame of numeric datatypes, operator=c("normal","derivative") will use the normal (i.e. PDF) kernel for variable one and the derivative of the PDF kernel for variable two. Please note that applying operators will scale the results by factors of $h$ or $1/h$ where appropriate.

The option permutation.operator= can be used to ‘mix and match’ operator strings to create a ‘hybrid’ kernel, in addition to the kernel sum with no operators applied, one for each continuous dimension in the data. For example, for a two-dimensional data frame of numeric datatypes, permutation.operator=c("derivative") will return the usual kernel sum as if operator = c("normal","normal") in the ksum member, and in the p.ksum member, it will return kernel sums for operator = c("derivative","normal"), and operator = c("normal","derivative"). This makes the computation of gradients much easier.

The option compute.score= can be used to compute the gradients with respect to $h$ in addition to the normal kernel sum. Like permutations, the additional results are returned in the p.ksum. This option does not work in conjunction with permutation.operator.

The option compute.ocg= works much like permutation.operator, but for discrete variables. The kernel is evaluated at a reference category in each dimension: for ordered data, the next lowest category is selected, except in the case of the lowest category, where the second lowest category is selected; for unordered data, the first category is selected. These additional data are returned in the p.ksum member. This option can be set simultaneously with permutation.operator.

The option return.kernel.weights=TRUE returns a matrix of dimension ‘number of training observations’ by ‘number of evaluation observations’ and contains only the generalized product kernel weights ignoring all other objects and options that may be provided to npksum (e.g. bandwidth.divide=TRUE will be ignored, etc.). Summing the columns of the weight matrix and dividing by ‘number of training observations’ times the product of the bandwidths (i.e. colMeans(foo$kw)/prod(h)) would produce the kernel estimator of a (multivariate) density (operator="normal") or multivariate cumulative distribution (operator="integral").

References

Aitchison, J. and C.G.G. Aitken (1976), “ Multivariate binary discrimination by the kernel method,” Biometrika, 63, 413-420.

Li, Q. and J.S. Racine (2007), Nonparametric Econometrics: Theory and Practice, Princeton University Press.

Li, Q. and J.S. Racine (2003), “Nonparametric estimation of distributions with categorical and continuous data,” Journal of Multivariate Analysis, 86, 266-292.

Pagan, A. and A. Ullah (1999), Nonparametric Econometrics, Cambridge University Press.

Scott, D.W. (1992), Multivariate Density Estimation. Theory, Practice and Visualization, New York: Wiley.

Silverman, B.W. (1986), Density Estimation, London: Chapman and Hall.

Wang, M.C. and J. van Ryzin (1981), “A class of smooth estimators for discrete distributions,” Biometrika, 68, 301-309.

Examples

Run this code

if (FALSE) {
## Not run in checks: excluded to keep MPI examples stable and check times short.
## The following example is adapted for interactive parallel execution
## in R. Here we spawn 1 slave so that there will be two compute nodes
## (master and slave).  Kindly see the batch examples in the demos
## directory (npRmpi/demos) and study them carefully. Also kindly see
## the more extensive examples in the np package itself. See the npRmpi
## vignette for further details on running parallel np programs via
## vignette("npRmpi",package="npRmpi").

## Start npRmpi for interactive execution. If slaves are already running and
## `options(npRmpi.reuse.slaves=TRUE)` (default on some systems), this will
## reuse the existing pool instead of respawning. To change the number of
## slaves, call `npRmpi.stop(force=TRUE)` then restart.
npRmpi.start(nslaves=1)

n <- 100000
x <- rnorm(n)
x.eval <- seq(-4, 4, length=50)

mpi.bcast.Robj2slave(x)
mpi.bcast.Robj2slave(x.eval)

mpi.bcast.cmd(bw <- npudensbw(dat=x, bwmethod="normal-reference"),
              caller.execute=TRUE)

mpi.bcast.cmd(den.ksum <- npksum(txdat=x, exdat=x.eval, bws=bw$bw,
                   bandwidth.divide=TRUE)$ksum/n,
              caller.execute=TRUE)

den.ksum

## For the interactive run only we close the slaves perhaps to proceed
## with other examples and so forth. This is redundant in batch mode.

## Note: on some systems (notably macOS+MPICH), repeatedly spawning and
## tearing down slaves in the same R session can lead to hangs/crashes.
## npRmpi may therefore keep slave daemons alive by default and
## `npRmpi.stop()` performs a "soft close". Use `force=TRUE` to
## actually shut down the slaves.
##
## You can disable reuse via `options(npRmpi.reuse.slaves=FALSE)` or by
## setting the environment variable `NP_RMPI_NO_REUSE_SLAVES=1` before
## loading the package.

npRmpi.stop()               ## soft close (may keep slaves alive)
## npRmpi.stop(force=TRUE)  ## hard close

## Note that in order to exit npRmpi properly avoid quit(), and instead
## use mpi.quit() as follows.

## mpi.bcast.cmd(mpi.quit(),
##               caller.execute=TRUE)
}

Run the code above in your browser using DataLab