kden: Kernel Density Estimation Using Normal Kernel

Description

Density, cumulative distribution function, quantile function and random number generation for the kernel density estimation using the normal kernel with a constant bandwidth lambda. The kernel centres (typically the data) are given by kerncentres.

Usage

dkden(x, kerncentres, lambda = NULL, log = FALSE)

  pkden(q, kerncentres, lambda = NULL, lower.tail = TRUE)

  qkden(p, kerncentres, lambda = NULL, lower.tail = TRUE)

  rkden(n = 1, kerncentres, lambda = NULL)

Arguments

lambda

bandwidth for normal kernel (standard deviation of normal)

kerncentres

kernel centres (typically sample data)

quantile

log

logical, if TRUE then log density

quantile

lower.tail

logical, if FALSE then upper tail probabilities

cumulative probability

sample size (non-negative integer)

Value

dkden gives the density, pkden gives the cumulative distribution function, qkden gives the quantile function and rkden gives a random sample.

Details

Kernel density estimation using normal density as kernel. The density function dkden produces exactly the same density estimate as density when a sequence of x values are provided, see examples. The latter function is far more efficient in this situation as it takes advantage of the computational savings from doing the kernel smoothing in the spectral domain, where the convolution becomes a multiplication. So even after accounting for applying the (Fast) Fourier Transform (FFT) and its inverse it is much more efficient especially for a large sample size. However, this KDE function applies the less efficient convolution using the classic definition: $$\hat{f}_(x) = \frac{1}{n} \sum_{j=1}^{n} K(\frac{x - x_j}{\lambda})$$ where $K(.)$ is the density function for the standard normal. Thus is no restriction on the values x can take. Computationally for a particular x the density is then just mean(dnorm(x, kerncentres, lambda)) for the density and mean(pnorm(x, kerncentres, lambda)) for cumulative distribution function. The random number generation is achieved by treating the KDE as a mixture model with equal probability of coming from each kernel, given by

rnorm(rep(1, n), sample(kerncentres, n,
  replace = TRUE), sd = lambda)

. The sample() function decides which kernel each of the n generated samples comes from and gives the normal random number generator the kernel center as its mean, with lambda as the kernel standard deviation. The quantile function is rather more complicated as there is no closed form solution, so is typically obtained by approximation or numerical solution to $P(X \le x_p) = p$ to find $x_p$. The quantile function qkden evaluates the KDE cumulative distribution function over the range from

c(min(kerncentre) - 5*lambda, max(kerncentre) -
  5*lambda)

as for normal kernel the probability of being outside this range is of the order 1e-7. Outside of the range the quantiles are set to -Inf for lower tail and Inf for upper tail. A sequence of values of length fifty times the number of kernels is first calculated. Spline based interpolation using splinefun, with default fmm method, is then used to approximate the quantile function. This is a similar approach to that taken by Matt Wand in the qkde in the ks package. If no bandwidth is provided lambda=NULL then the normal reference rule is used, from the function bw.nrd0, which is consistent with the density function. At least two kernel centres must be provided as the variance needs to be estimated.

References

http://en.wikipedia.org/wiki/Kernel_density_estimation http://en.wikipedia.org/wiki/Cross-validation_(statistics) Scarrott, C.J. and MacDonald, A. (2012). A review of extreme value threshold estimation and uncertainty quantification. REVSTAT - Statistical Journal 10(1), 33-59. Available from http://www.ine.pt/revstat/pdf/rs120102.pdf Bowman, A.W. (1984). An alternative method of cross-validation for the smoothing of density estimates. Biometrika 71(2), 353-360. Duin, R.P.W. (1976). On the choice of smoothing parameters for Parzen estimators of probability density functions. IEEE Transactions on Computers C25(11), 1175-1179. MacDonald, A., Scarrott, C.J., Lee, D., Darlow, B., Reale, M. and Russell, G. (2011). A flexible extreme value mixture model. Computational Statistics and Data Analysis 55(6), 2137-2157.

Examples

Run this code

nk=50
x = rnorm(nk)
xx = seq(-5, 5, 0.01)
plot(xx, dnorm(xx))
rug(x)
for (i in 1:nk) lines(xx, dnorm(xx, x[i], sd = bw.nrd0(x))*0.05)
lines(xx, dkden(xx, x), lwd = 2, col = "red")
lines(density(x), lty = 2, lwd = 2, col = "green")
legend("topright", c("True Density", "KDE Using evmix", "KDE Using density function"),
lty = c(1, 1, 2), lwd = c(1, 2, 2), col = c("black", "red", "green"))

# Estimate bandwidth using cross-validation likelihood
fit = fkden(x)
hist(x, nk/5, freq = FALSE, xlim = c(-5, 5))
rug(x)
for (i in 1:nk) lines(xx, dnorm(xx, x[i], sd = fit$lambda)*0.05)
lines(xx,dnorm(xx), col = "black")
lines(xx, dkden(xx, x, lambda = fit$lambda), lwd = 2, col = "red")
lines(density(x), lty = 2, lwd = 2, col = "green")
lines(density(x, bw = fit$lambda), lwd = 2, lty = 2,  col = "blue")
legend("topright", c("True Density", "KDE fitted evmix",
"KDE Using density, default bandwidth", "KDE Using density, c-v likelihood bandwidth"),
lty = c(1, 1, 2, 2), lwd = c(1, 2, 2, 2), col = c("black", "red", "green", "blue"))

plot(xx, pnorm(xx), type = "l")
rug(x)
for (i in 1:nk) lines(xx, dnorm(xx, x[i], sd = fit$lambda)*0.05)
lines(xx, pkden(xx, x), lwd = 2, col = "red")
lines(xx, pkden(xx, x, lambda = fit$lambda), lwd = 2, col = "green")
# green and blue (quantile) function should be same
p = seq(0, 1, 0.001)
lines(qkden(p, x, lambda = fit$lambda), p, lwd = 2, lty = 2, col = "blue")
legend("topleft", c("True Density", "KDE using evmix, normal reference rule",
"KDE using evmix, c-v likelihood","KDE quantile function, c-v likelihood"),
lty = c(1, 1, 1, 2), lwd = c(1, 2, 2, 2), col = c("black", "red", "green", "blue"))

xnew = rkden(1000, x, lambda = fit$lambda)
hist(xnew, breaks = 200, freq = FALSE, xlim = c(-5, 5))
rug(xnew)
lines(xx,dnorm(xx), col = "black")
lines(xx, dkden(xx, x), lwd = 2, col = "red")
legend("topright", c("True Density", "KDE Using evmix"),
lty = c(1, 2), lwd = c(1, 2), col = c("black", "red"))

Run the code above in your browser using DataLab