featureSignif: Feature significance for kernel density estimation

Description

Identify significant features of kernel density estimates of 1- to 4-dimensional data.

Usage

featureSignif(x, bw, gridsize, scaleData=FALSE, addSignifGrad=TRUE,
   addSignifCurv=TRUE, signifLevel=0.05)

Arguments

data matrix

vector of bandwidth(s)

gridsize

vector of estimation grid sizes

scaleData

flag for scaling the data i.e. transforming to unit variance for each dimension.

addSignifGrad

flag for computing significant gradient regions

addSignifCurv

flag for computing significant curvature regions

signifLevel

significance level

Value

Returns an object of class fs which is a list with the following fields x - data matrix names - name labels used for plotting bw - vector of bandwidths fhat - kernel density estimate on a grid grad - logical grid for significant gradient curv - logical grid for significant curvature gradData - logical vector for significant gradient data points gradDataPoints - significant gradient data points curvData - logical vector for significant curvature data points curvDataPoints - significant curvature data points

Details

Feature significance is based on significance testing of the gradient (first derivative) and curvature (second derivative) of a kernel density estimate. This was developed for 1-d data by Chaudhuri & Marron (1995), for 2-d data by Godtliebsen, Marron & Chaudhuri (1999), and for 3-d and 4-d data by Duong, Cowling, Koch & Wand (2007).

The test statistic for gradient testing is at a point $\mathbf{x}$ is $$W(\mathbf{x}) = \Vert \widehat{\nabla f} (\mathbf{x}; \mathbf{H}) \Vert^2$$ where $\widehat{\nabla f} (\mathbf{x};\mathbf{H})$ is kernel estimate of the gradient of $f(\mathbf{x})$ with bandwidth $\mathbf{H}$, and $\Vert\cdot\Vert$ is the Euclidean norm. $W(\mathbf{x})$ is approximately chi-squared distributed with $d$ degrees of freedom where $d$ is the dimension of the data.

The analogous test statistic for curvature is $$W^{(2)}(\mathbf{x}) = \Vert \mathrm{vech} \widehat{\nabla^{(2)}f} (\mathbf{x}; \mathbf{H})\Vert ^2$$ where $\widehat{\nabla^{(2)} f} (\mathbf{x};\mathbf{H})$ is the kernel estimate of the curvature of $f(\mathbf{x})$, and vech is the vector-half operator. $W^{(2)}(\mathbf{x})$ is approximately chi-squared distributed with $d(d+1)/2$ degrees of freedom.

Since this is a situation with many dependent hypothesis tests, we use a multiple comparison or simultaneous test to control the overall level of significance. We use a Hochberg-type procedure. See Hochberg (1988) and Duong, Cowling, Koch & Wand (2007).

References

Chaudhuri, P. and Marron, J.S. (1999) SiZer for exploration of structures in curves. Journal of the American Statistical Association, 94, 807-823.

Duong, T., Cowling, A., Koch, I., Wand, M.P. (2008) Feature significance for multivariate kernel density estimation. Computational Statistics and Data Analysis, 52, 4225-4242.

Hochberg, Y. (1988) A sharper Bonferroni procedure for multiple tests of significance. Biometrika, 75, 800-802. Godtliebsen, F., Marron, J.S. and Chaudhuri, P. (2002) Significance in scale space for bivariate density estimation. Journal of Computational and Graphical Statistics, 11, 1-22.

Wand, M.P. and Jones, M.C. (1995) Kernel Smoothing. Chapman & Hall/CRC, London.

Examples

Run this code

## Univariate example
data(earthquake)
eq3 <- -log10(-earthquake[,3])
fs <- featureSignif(eq3, bw=0.1)
plot(fs, addSignifGradRegion=TRUE)

## Bivariate example
library(MASS)
data(geyser)
fs <- featureSignif(geyser)
plot(fs, addSignifCurvRegion=TRUE)

## Trivariate example
data(earthquake)
earthquake[,3] <- -log10(-earthquake[,3])
fs <- featureSignif(earthquake, scaleData=TRUE, bw=c(0.06, 0.06, 0.05))
plot(fs, addKDE=TRUE)
plot(fs, addKDE=FALSE, addSignifCurvRegion=TRUE)

Run the code above in your browser using DataLab