Learn R Programming

feature (version 1.1-3)

featureSignif: Feature significance for kernel density estimation

Description

Identify significant features based on kernel density estimates of 1- to 4-dimensional data. The user is able to interactively choose the bandwidths or pre-specify the bandwidths non-interactively for the kernel density estimate.

Usage

featureSignif(x, bw, xlab, ylab, zlab, xlim, ylim, zlim,
   addData=FALSE, scaleData=FALSE, addDataNum=1000,
   addKDE=TRUE, jitterRug=TRUE, signifLevel=0.05, 
   addSignifGradRegion=FALSE, addSignifGradData=FALSE,
   addSignifCurvRegion=FALSE, addSignifCurvData=FALSE,
   plotSiZer=FALSE, addAxes3d=TRUE,
   densCol, dataCol="black", gradCol="green4", curvCol="blue",
   axisCol="black", bgCol="white", gridsize)

Arguments

x
data matrix
bw
bandwidth(s) - see below for details on how to specify bandwidths
xlim, ylim, zlim
x-, y-, z-axis limits
xlab, ylab, zlab
x-, y-, z-axis labels
scaleData
flag for scaling the data i.e. transforming to unit variance for each dimension. Default is FALSE.
addData
flag for display of the data. Default is FALSE.
addDataNum
maximum number of data points plotted in displays. Default is 1000.
addKDE
flag for display of kernel density estimates. Default is TRUE. Not available for 4-d data.
jitterRug
flag for jittering of rug-plot for univariate data display. Default is TRUE.
addSignifGradRegion
flag for display of significant gradient regions. Default is FALSE. Not available for 4-d data.
addSignifGradData
flag for display of significant gradient data points. Default is FALSE.
addSignifCurvRegion
flag for display of significant curvature regions. Default is FALSE. Not available for 4-d data.
addSignifCurvData
flag for display of significant curvature data points. Default is FALSE.
plotSiZer
flag for display of 1-d gradient SiZer map. Default is FALSE.
addAxes3d
flag for displaying axes in 3-d displays. Default is TRUE.
signifLevel
significance level. Default is 0.05.
densCol
colour of density estimate curve. Default for 1-d data "DarkOrange". Default for 2-d data is heat.colors(1000). Default for 3-d data is heat.colors(5).
dataCol
colour of data points. Default is "black".
gradCol
colour of significant gradient regions/points. Default is "green4".
curvCol
colour of significant curvature regions/points. Default is "blue".
axisCol
colour of axes. Default is "black".
bgCol
colour of background. Default is "white".
gridsize
vector of the number of grid points in each direction.

Value

  • If bw is not specified, then a range of possible bandwidths is automatically calculated. For univariate data, bw can be either a scalar or a vector. With the former, a KDE is computed with this scalar bandwidth. The latter is interpreted as a range of bandwidths.

    For multivariate data, bw can either be a vector or a matrix. With the former, a KDE is computed with this vector bandwidth. The latter is interpreted as a range of bandwidths with the first row are the minimum values and the second row the maximum values.

    For a range of bandwidths, it is in interactive mode. For a single bandwidth it is in non-interactive mode.

    Returns a list with the following fields x - data matrix bw - vector of bandwidths fhat - kernel density estimate on a grid (output from drvkde) grad - Boolean matrix which indicates significant gradient on grid curv - Boolean matrix which indicates significant curvature on grid

    In the interactive case, the return values are based on the last bandwidths chosen before the interactive session was ended. In the non-interactive case, the return values are based on the specified bandwidth. For 1-d data, the gradient SiZer map for of Chaudhuri & Marron (1999) is implemented. If this option is selected, it automatically goes into non-interactive mode. The horizontal axis is the data axis, the vertical axis are the bandwidths. It returns a list with the following fields x.grid - vector of grid points bw - vector of bandwidths at grid points SiZer - matrix (rows = grid points, columns = bandwidths) for SiZer map: 3 = decreasing gradient (red), 2 = increasing gradient (blue), 1 = zero gradient (purple), 0 = sparse region (grey).

Details

Feature significance is based on significance testing of the gradient (first derivative) and curvature (second derivative) of a kernel density estimate. This was developed for 1-d data by Chaudhuri & Marron (1995), for 2-d data by Godtliebsen, Marron & Chaudhuri (1999), and for 3-d and 4-d data by Duong, Cowling, Koch & Wand (2006).

The test statistic for gradient testing is at a point $\mathbf{x}$ is $$W(\mathbf{x}) = \Vert \widehat{\nabla f} (\mathbf{x}; \mathbf{H}) \Vert^2$$ where $\widehat{\nabla f} (\mathbf{x};\mathbf{H})$ is kernel estimate of the gradient of $f(\mathbf{x})$ with bandwidth $\mathbf{H}$, and $\Vert\cdot\Vert$ is the Euclidean norm. $W(\mathbf{x})$ is approximately chi-squared distributed with $d$ degrees of freedom where $d$ is the dimension of the data.

The test statistic for curvature is analogous to that for gradient testing: $$W^{(2)}(\mathbf{x}) = \Vert \mathrm{vech} \widehat{\nabla^{(2)}f} (\mathbf{x}; \mathbf{H})\Vert ^2$$ where $\widehat{\nabla^{(2)} f} (\mathbf{x};\mathbf{H})$ is the kernel estimate of the curvature of $f(\mathbf{x})$, and vech is the vector-half operator. $W^{(2)}(\mathbf{x})$ is approx. chi-squared distributed with $d(d+1)/2$ degrees of freedom.

Since this is a situation with many dependent hypothesis tests, we use a multiple comparison or simultaneous test to control the overall level of significance. We use a Hochberg-type procedure. See Hochberg (1988) and Duong, Cowling, Koch & Wand (2006).

References

Chaudhuri, P. and Marron, J.S. (1999) SiZer for exploration of structures in curves. Journal of the American Statistical Association, 94, 807-823.

Duong, T., Cowling, A., Koch, I., Wand, M.P. (2006) Feature significance for multivariate kernel density estimation. Submitted.

Hochberg, Y. (1988) A sharper Bonferroni procedure for multiple tests of significance. Biometrika, 75, 800-802. Godtliebsen, F., Marron, J.S. and Chaudhuri, P. (2002) Significance in scale space for bivariate density estimation. Journal of Computational and Graphical Statistics, 11, 1-22.

Wand, M.P. and Jones, M.C. (1995) Kernel Smoothing Chapman and Hall.

See Also

bkde (in package `KernSmooth'), bkde2D (in package `KernSmooth'), density

Examples

Run this code
## Non-interactive examples

## Univariate example
data(earthquake)
eq3 <- -log10(-earthquake[,3])

featureSignif(eq3, addSignifGradRegion=TRUE,xlab="-log(-depth)", bw=0.1)

## combined signif. gradient plot and gradient SiZer plot
layout(matrix(1:2, nrow=2))
featureSignif(eq3, addSignifGradRegion=TRUE,xlab="-log(-depth)", bw=0.1)
xlim <- par()$usr[1:2]
featureSignif(eq3, plotSiZer=TRUE, xlab="-log(-depth)", xlim=xlim)
lines(c(-2, 2), c(0.1, 0.1))
layout(1)

## Bivariate example

library(MASS)
data(geyser)

fs <- featureSignif(geyser, addSignifGradRegion=TRUE,
     addSignifCurvRegion=TRUE, bw=c(4.5, 0.37))
names(fs)

## Trivariate example

data(earthquake)
earthquake[,3] <- -log10(-earthquake[,3])

featureSignif(earthquake, scaleData=TRUE, addData=TRUE,
    bw=c(0.0381, 0.0381, 0.0442))

featureSignif(earthquake, addKDE=FALSE, scaleData=TRUE,
   addSignifGradRegion=TRUE, addSignifCurvRegion=TRUE,
   bw=c(0.0381, 0.0381, 0.0442),
   xlim=c(0.4,0.5), ylim=c(0.4,0.5), zlim=c(0.8,0.9))

## Quadrivariate example

library(MASS)
data(iris) 
featureSignif(iris[,1:4], addSignifGradData=TRUE,
   addSignifCurvRegion=TRUE, bw=c(0.457, 0.210, 0.960, 0.413))

## Interactive examples

library(MASS)
data(geyser)
duration <- geyser$duration 

## Univariate example

featureSignif(duration)
featureSignif(duration, addData=TRUE)
featureSignif(duration, addSignifGradRegion=TRUE,
   addSignifGradData=TRUE)
featureSignif(duration, addSignifCurvRegion=TRUE,
   addSignifCurvData=TRUE)

## Bivariate example

featureSignif(geyser, addData=TRUE, addSignifGradRegion=TRUE,
   addSignifGradData=TRUE, bw=rbind(c(1, 0.1), c(5, 0.9)))
   ## bandwidths ranges: h1 in c(1, 5), h2 in c(0.1, 0.9)

## Trivariate example

data(earthquake)
earthquake$depth <- -log10(-earthquake$depth)
featureSignif(earthquake, addSignifGradRegion=TRUE, scaleData=TRUE)

## Quadrivariate example

library(MASS)
data(iris)
featureSignif(iris[,1:4], addSignifGradData=TRUE, addSignifCurvRegion=TRUE)

Run the code above in your browser using DataLab