npcdens: Kernel Conditional Density Estimation with Mixed Data Types

Description

npcdens computes kernel conditional density estimates on \(p+q\)-variate evaluation data, given a set of training data (both explanatory and dependent) and a bandwidth specification (a conbandwidth object or a bandwidth vector, bandwidth type, and kernel type) using the method of Hall, Racine, and Li (2004). The data may be continuous, discrete (unordered and ordered factors), or some combination thereof.

Usage

npcdens(bws, ...)
# S3 method for formula
npcdens(bws, data = NULL, newdata = NULL, ...)
# S3 method for call
npcdens(bws, ...)
# S3 method for conbandwidth
npcdens(bws,
        txdat = stop("invoked without training data 'txdat'"),
        tydat = stop("invoked without training data 'tydat'"),
        exdat,
        eydat,
        gradients = FALSE,
        ...)
# S3 method for default
npcdens(bws, txdat, tydat, ...)

Value

npcdens returns a condensity object. The generic accessor functions fitted, se, and

gradients, extract estimated values, asymptotic standard errors on estimates, and gradients, respectively, from the returned object. Furthermore, the functions predict,

summary and plot support objects of both classes. The returned objects have the following components:

xbw: bandwidth(s), scale factor(s) or nearest neighbours for the explanatory data, txdat
ybw: bandwidth(s), scale factor(s) or nearest neighbours for the dependent data, tydat
xeval: the evaluation points of the explanatory data
yeval: the evaluation points of the dependent data
condens: estimates of the conditional density at the evaluation points
conderr: standard errors of the conditional density estimates
congrad: if invoked with gradients = TRUE, estimates of the gradients at the evaluation points
congerr: if invoked with gradients = TRUE, standard errors of the gradients at the evaluation points
log_likelihood: log likelihood of the conditional density estimate

Arguments

bws

a bandwidth specification. This can be set as a conbandwidth object returned from a previous invocation of npcdensbw, or as a \(p+q\)-vector of bandwidths, with each element \(i\) up to \(i=q\) corresponding to the bandwidth for column \(i\) in tydat, and each element \(i\) from \(i=q+1\) to \(i=p+q\) corresponding to the bandwidth for column \(i-q\) in txdat. If specified as a vector, then additional arguments will need to be supplied as necessary to specify the bandwidth type, kernel types, training data, and so on.

gradients

a logical value specifying whether to return estimates of the gradients at the evaluation points. Defaults to FALSE.

...

additional arguments supplied to specify the bandwidth type, kernel types, and so on. This is necessary if you specify bws as a \(p+q\)-vector and not a conbandwidth object, and you do not desire the default behaviours. To do this, you may specify any of bwmethod, bwscaling, bwtype, cxkertype, cxkerorder, cykertype, cykerorder, uxkertype, uykertype, oxkertype, oykertype, as described in npcdensbw.

data

an optional data frame, list or environment (or object coercible to a data frame by as.data.frame) containing the variables in the model. If not found in data, the variables are taken from environment(bws), typically the environment from which npcdensbw was called.

newdata

An optional data frame in which to look for evaluation data. If omitted, the training data are used.

txdat

a \(p\)-variate data frame of sample realizations of explanatory data (training data). Defaults to the training data used to compute the bandwidth object.

tydat

a \(q\)-variate data frame of sample realizations of dependent data (training data). Defaults to the training data used to compute the bandwidth object.

exdat

a \(p\)-variate data frame of explanatory data on which conditional densities will be evaluated. By default, evaluation takes place on the data provided by txdat.

eydat

a \(q\)-variate data frame of dependent data on which conditional densities will be evaluated. By default, evaluation takes place on the data provided by tydat.

Author

Tristen Hayfield tristen.hayfield@gmail.com, Jeffrey S. Racine racinej@mcmaster.ca

Usage Issues

If you are using data of mixed types, then it is advisable to use the data.frame function to construct your input data and not cbind, since cbind will typically not work as intended on mixed data types and will coerce the data to the same type.

Details

npcdens implements a variety of methods for estimating multivariate conditional distributions (\(p+q\)-variate) defined over a set of possibly continuous and/or discrete (unordered, ordered) data. The approach is based on Li and Racine (2004) who employ ‘generalized product kernels’ that admit a mix of continuous and discrete data types.

Three classes of kernel estimators for the continuous data types are available: fixed, adaptive nearest-neighbor, and generalized nearest-neighbor. Adaptive nearest-neighbor bandwidths change with each sample realization in the set, \(x_i\), when estimating the density at the point \(x\). Generalized nearest-neighbor bandwidths change with the point at which the density is estimated, \(x\). Fixed bandwidths are constant over the support of \(x\).

Training and evaluation input data may be a mix of continuous (default), unordered discrete (to be specified in the data frames using factor), and ordered discrete (to be specified in the data frames using ordered). Data can be entered in an arbitrary order and data types will be detected automatically by the routine (see npRmpi for details).

A variety of kernels may be specified by the user. Kernels implemented for continuous data types include the second, fourth, sixth, and eighth order Gaussian and Epanechnikov kernels, and the uniform kernel. Unordered discrete data types use a variation on Aitchison and Aitken's (1976) kernel, while ordered data types use a variation of the Wang and van Ryzin (1981) kernel.

References

Aitchison, J. and C.G.G. Aitken (1976), “Multivariate binary discrimination by the kernel method,” Biometrika, 63, 413-420.

Hall, P. and J.S. Racine and Q. Li (2004), “Cross-validation and the estimation of conditional probability densities,” Journal of the American Statistical Association, 99, 1015-1026.

Li, Q. and J.S. Racine (2007), Nonparametric Econometrics: Theory and Practice, Princeton University Press.

Pagan, A. and A. Ullah (1999), Nonparametric Econometrics, Cambridge University Press.

Scott, D.W. (1992), Multivariate Density Estimation. Theory, Practice and Visualization, New York: Wiley.

Silverman, B.W. (1986), Density Estimation, London: Chapman and Hall.

Wang, M.C. and J. van Ryzin (1981), “A class of smooth estimators for discrete distributions,” Biometrika, 68, 301-309.

Examples

Run this code

if (FALSE) {
## Not run in checks: excluded to keep MPI examples stable and check times short.
## The following example is adapted for interactive parallel execution
## in R. Here we spawn 1 slave so that there will be two compute nodes
## (master and slave).  Kindly see the batch examples in the demos
## directory (npRmpi/demos) and study them carefully. Also kindly see
## the more extensive examples in the np package itself. See the npRmpi
## vignette for further details on running parallel np programs via
## vignette("npRmpi",package="npRmpi").

## Start npRmpi for interactive execution. If slaves are already running and
## `options(npRmpi.reuse.slaves=TRUE)` (default on some systems), this will
## reuse the existing pool instead of respawning. To change the number of
## slaves, call `npRmpi.stop(force=TRUE)` then restart.
npRmpi.start(nslaves=1)

mpi.bcast.cmd(data("Italy"),
              caller.execute=TRUE)
mpi.bcast.cmd(attach(Italy),
              caller.execute=TRUE)

mpi.bcast.cmd(bw <- npcdensbw(formula=gdp~ordered(year)),
              caller.execute=TRUE)

mpi.bcast.cmd(fhat <- npcdens(bws=bw),
              caller.execute=TRUE)

summary(fhat)

## For the interactive run only we close the slaves perhaps to proceed
## with other examples and so forth. This is redundant in batch mode.

## Note: on some systems (notably macOS+MPICH), repeatedly spawning and
## tearing down slaves in the same R session can lead to hangs/crashes.
## npRmpi may therefore keep slave daemons alive by default and
## `npRmpi.stop()` performs a "soft close". Use `force=TRUE` to
## actually shut down the slaves.
##
## You can disable reuse via `options(npRmpi.reuse.slaves=FALSE)` or by
## setting the environment variable `NP_RMPI_NO_REUSE_SLAVES=1` before
## loading the package.

npRmpi.stop()               ## soft close (may keep slaves alive)
## npRmpi.stop(force=TRUE)  ## hard close

## Note that in order to exit npRmpi properly avoid quit(), and instead
## use mpi.quit() as follows.

## mpi.bcast.cmd(mpi.quit(),
##               caller.execute=TRUE)
}

Run the code above in your browser using DataLab